Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM after node is synced to head #1715

Closed
twoeths opened this issue Nov 1, 2020 · 6 comments
Closed

OOM after node is synced to head #1715

twoeths opened this issue Nov 1, 2020 · 6 comments
Assignees

Comments

@twoeths
Copy link
Contributor

twoeths commented Nov 1, 2020

Describe the bug
Node is synced to head:

2020-11-01 01:36:48 [SYNC]             info: Regular Sync: processed up to current slot 635283
2020-11-01 01:36:48 [SYNC]             info: Regular Sync: fetcher returns empty array, finish sync now

saw some valid gossip blocks:

2020-11-01 01:37:27 [GOSSIP]        ^[[36mverbose^[[39m: Incoming gossip block at slot: 635286
...
 2020-11-01 01:52:34 [CHAIN]           ^[[34mdebug^[[39m: Block processed slot=635360, root=0xfe797cb4160009147d       901b9b49db4871f52124a50bb3aa422a0a527128fbbc94

2 minutes after that node is crashed due to OOM

<--- JS stacktrace --->
739036
739037 ==== JS stack trace =========================================
739038
739039     0: ExitFrame [pc: 0x1009288f9]
739040 Security context: 0x21ede0380919 <JSObject>
739041     1: rebindLeft [0x21ed62d0ecb1] [/Users/tuyennguyen/Projects/workshop/lodestar/node_modules/@chainsafe/persi       stent-merkle-tree/lib/node.js:~57] [pc=0x376646bdff53](this=0x21ed386566f9 <Node map = 0x21edad3809f9>,0x21ed38       668f59 <Node map = 0x21edad3809f9>)
739042     2: /* anonymous */(aka /* anonymous */) [0x21ed38656e61] [/Users/tuyennguyen/Projects/workshop/lodest...

Expected behavior
Node can stay synced

I notice that previously I disabled subscribing to all gossip topics (except for block), now I apply #1712 so that should cause the issue.

@twoeths
Copy link
Contributor Author

twoeths commented Nov 1, 2020

Right before OOM happened, there is ERR_QUEUE_THROTTLED which means node had to regen a lot.

For help, see: https://nodejs.org/en/docs/inspector
(node:21583) UnhandledPromiseRejectionWarning: Error: ERR_QUEUE_THROTTLED
    at JobQueue.enqueueJob (/Users/tuyennguyen/Projects/workshop/lodestar/packages/lodestar/src/util/queue.ts:63:13)
    at QueuedStateRegenerator.getState (/Users/tuyennguyen/Projects/workshop/lodestar/packages/lodestar/src/chain/regen/queued.ts:53:32)
    at BeaconChain.getHeadStateContext (/Users/tuyennguyen/Projects/workshop/lodestar/packages/lodestar/src/chain/chain.ts:108:32)
    at processTicksAndRejections (internal/process/task_queues.js:93:5)
    at BeaconChain.getForkDigest (/Users/tuyennguyen/Projects/workshop/lodestar/packages/lodestar/src/chain/chain.ts:293:21)
    at BeaconReqRespHandler.shouldDisconnectOnStatus (/Users/tuyennguyen/Projects/workshop/lodestar/packages/lodestar/src/sync/reqResp/reqResp.ts:126:31)
    at BeaconReqRespHandler.onStatus (/Users/tuyennguyen/Projects/workshop/lodestar/packages/lodestar/src/sync/reqResp/reqResp.ts:105:9)
    at ReqResp.<anonymous> (/Users/tuyennguyen/Projects/workshop/lodestar/packages/lodestar/src/sync/reqResp/reqResp.ts:88:16)
(node:21583) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 189787)
(node:21583) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.
(node:21583) UnhandledPromiseRejectionWarning: Error: ERR_QUEUE_THROTTLED
    at JobQueue.enqueueJob (/Users/tuyennguyen/Projects/workshop/lodestar/packages/lodestar/src/util/queue.ts:63:13)
    at QueuedStateRegenerator.getState (/Users/tuyennguyen/Projects/workshop/lodestar/packages/lodestar/src/chain/regen/queued.ts:53:32)
    at BeaconChain.getHeadStateContext (/Users/tuyennguyen/Projects/workshop/lodestar/packages/lodestar/src/chain/chain.ts:108:32)
    at processTicksAndRejections (internal/process/task_queues.js:93:5)
    at BeaconChain.getForkDigest (/Users/tuyennguyen/Projects/workshop/lodestar/packages/lodestar/src/chain/chain.ts:293:21)
    at BeaconReqRespHandler.shouldDisconnectOnStatus (/Users/tuyennguyen/Projects/workshop/lodestar/packages/lodestar/src/sync/reqResp/reqResp.ts:126:31)
    at BeaconReqRespHandler.onStatus (/Users/tuyennguyen/Projects/workshop/lodestar/packages/lodestar/src/sync/reqResp/reqResp.ts:105:9)
    at ReqResp.<anonymous> (/Users/tuyennguyen/Projects/workshop/lodestar/packages/lodestar/src/sync/reqResp/reqResp.ts:88:16)
(node:21583) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 189789)
FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory

Writing Node.js report to file: report.20201101.015949.21583.0.001.json
Node.js report completed
 1: 0x100078c35 node::Abort() [/Users/tuyennguyen/.nvm/versions/node/v12.10.0/bin/node]
 2: 0x100078db9 node::OnFatalError(char const*, char const*) [/Users/tuyennguyen/.nvm/versions/node/v12.10.0/bin/node]
 3: 0x10016e8d7 v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, bool) [/Users/tuyennguyen/.nvm/versions/node/v12.10.0/bin/node]
 4: 0x10016e873 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, bool) [/Users/tuyennguyen/.nvm/versions/node/v12.10.0/bin/node]
 5: 0x1002f2605 v8::internal::Heap::FatalProcessOutOfMemory(char const*) [/Users/tuyennguyen/.nvm/versions/node/v12.10.0/bin/node]
 6: 0x1002f3d2d v8::internal::Heap::RecomputeLimits(v8::internal::GarbageCollector) [/Users/tuyennguyen/.nvm/versions/node/v12.10.0/bin/node]
 7: 0x1002f0ae6 v8::internal::Heap::PerformGarbageCollection(v8::internal::GarbageCollector, v8::GCCallbackFlags) [/Users/tuyennguyen/.nvm/versions/node/v12.10.0/bin/node]
 8: 0x1002eea3c v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags) [/Users/tuyennguyen/.nvm/versions/node/v12.10.0/bin/node]
 9: 0x1002fa114 v8::internal::Heap::AllocateRawWithLightRetry(int, v8::internal::AllocationType, v8::internal::AllocationAlignment) [/Users/tuyennguyen/.nvm/versions/node/v12.10.0/bin/node]
10: 0x1002fa18f v8::internal::Heap::AllocateRawWithRetryOrFail(int, v8::internal::AllocationType, v8::internal::AllocationAlignment) [/Users/tuyennguyen/.nvm/versions/node/v12.10.0/bin/node]
11: 0x1002c8737 v8::internal::Factory::NewFillerObject(int, bool, v8::internal::AllocationType) [/Users/tuyennguyen/.nvm/versions/node/v12.10.0/bin/node]
12: 0x1005e8f67 v8::internal::Runtime_AllocateInYoungGeneration(int, unsigned long*, v8::internal::Isolate*) [/Users/tuyennguyen/.nvm/versions/node/v12.10.0/bin/node]
13: 0x1009288f9 Builtins_CEntry_Return1_DontSaveFPRegs_ArgvOnStack_NoBuiltinExit [/Users/tuyennguyen/.nvm/versions/node/v12.10.0/bin/node]
14: 0x376646bdff53
15: 0x3766468911b1
Abort trap: 6

@twoeths twoeths self-assigned this Nov 1, 2020
@twoeths
Copy link
Contributor Author

twoeths commented Nov 2, 2020

althought I synced to head, I disabled signature verification and hence it did not process attestations in block. Maybe that's why my chain head is not close to the latest block. (latest one is 615614 which is out of date).

I should process block attestations and try again ...

@twoeths
Copy link
Contributor Author

twoeths commented Nov 3, 2020

the latest log shown that we didn't regen a lot, and all the crash looks like this:

 <--- Last few GCs --->
961486 
961487 [38328:0x10284d000] 36294882 ms: Mark-sweep 1945.2 (2068.3) -> 1930.1 (2068.1) MB, 1262.2 / 0.0 ms  (average mu = 0.121, current mu = 0.025) allocation failure scavenge might not succeed
961488 [38328:0x10284d000] 36296144 ms: Mark-sweep 1946.3 (2069.0) -> 1931.0 (2068.9) MB, 1229.7 / 0.0 ms  (average mu = 0.076, current mu = 0.026) allocation failure scavenge might not succeed
961489 
961490 
961491 <--- JS stacktrace --->

it looks very similar to nodejs/help#2388 which means we logged too much and the issue comes from the log itself.

I'd go with minimizing the AggregateAndProof gossip validation log and try again. (the latest log file I got is more than 10GB!!)

@twoeths
Copy link
Contributor Author

twoeths commented Nov 3, 2020

I have the same issue again, Mark-sweep still shows our memory at around 2GB.
So I think last time I don't see issues just because the non-finality time is not as long as the current situation.
@ChainSafe/eth2-0 we may need to increase heap memory like node --max-old-space-size=3072 index.js (3GB)?
I already reduce the MAX_EPOCHS to just 10 so don't want to reduce it more.

@mpetrunic
Copy link
Member

I already reduce the MAX_EPOCHS to just 10 so don't want to reduce it more.

I think lighthouse is running MAX_EPOCHS 2 🤣

@twoeths
Copy link
Contributor Author

twoeths commented Nov 19, 2020

Haven't seen the OOM recently (after I get through the unfinality period and stay synced) if I do node --max-old-space-size=3072.

@twoeths twoeths closed this as completed Nov 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants