fix: batch pending parent root fetches to avoid 300+ sequential round-trips#695
fix: batch pending parent root fetches to avoid 300+ sequential round-trips#695ch4r10t33r wants to merge 5 commits intomainfrom
Conversation
When a block arrives with a missing parent, the old code immediately sent an individual blocks_by_root request for that single parent root. A syncing peer walking a long parent chain (e.g. 300 slots back) would therefore open 300+ separate libp2p streams - one per ancestor - flooding both sides with individual round-trips. Replace the immediate fire-and-forget with a deferred queue: - Add `pending_parent_roots: AutoHashMap(Root, depth)` to BeamNode. - `cacheBlockAndFetchParent` now enqueues the missing parent root instead of calling `fetchBlockByRoots` directly. - `flushPendingParentFetches` drains the map and issues a single batched blocks_by_root request for all accumulated roots. - The flush is called at every natural exit point: after the missing-parent early-return in `onGossip`, at the end of `handleGossipProcessingResult`, and at the end of `processBlockByRootChunk`. When multiple gossip blocks arrive in the same burst with the same missing ancestor, all their parent roots are now collected and sent as one request instead of N separate requests.
zclawz
left a comment
There was a problem hiding this comment.
Good optimization — clear problem, clean solution. A few observations:
1. roots.deinit(self.allocator) — likely a bug
std.ArrayList stores its allocator internally, so deinit() takes no arguments. This should be roots.deinit() (no parameter). With the extra argument this won't compile.
If you intended an unmanaged list, use std.ArrayListUnmanaged + deinit(allocator). But since self.allocator is already captured at initCapacity, plain ArrayList + deinit() is the right choice here.
2. max_depth across all batched roots
Using the maximum depth across all pending roots for the single batched request is conservative but correct — the worst case is we request a few more ancestors than strictly needed for shallower roots. Fine as a starting point; could be refined later if needed.
3. FetchFailed → CachingFailed consolidation
Replacing the now-removed FetchFailed with CachingFailed (for the put allocation failure path) is reasonable — an allocation failure during enqueue is effectively a caching failure. Callers that handled FetchFailed may need updating if they had separate recovery logic, but since this is an internal error type it's unlikely to matter.
4. Flush coverage looks complete
The four flush sites (gossip early-return, handleGossipProcessingResult, processBlockByRootChunk end, and the RPC response handler) cover the natural exit points. Roots that accumulate on error paths will be flushed on the next message cycle, which is acceptable.
Fix the deinit call and this looks good to merge.
|
On point 1 : This project targets Zig 0.15, where // Zig 0.15 stdlib — std/array_list.zig
pub fn deinit(self: *Self, gpa: Allocator) void {Without the argument it fails to compile with: The same |
The Rust networking thread (spawned by create_and_run_network) calls the export fn callbacks directly, while the main libxev thread runs onInterval. Both access shared state (fetched_blocks cache, chain, fork-choice, etc.) concurrently with no synchronisation, causing heap corruption that manifests as an integer overflow in Bitlist.serializedSize when a corrupted length field is passed to ssz.serialize. Fix: add EthLibp2p.state_mutex and acquire it at the top of every export fn callback. The main-thread onInterval also acquires the same mutex via the BeamNode.state_mutex pointer (null in unit-tests, set to &EthLibp2p.state_mutex in production). This serialises all access to shared state between the two threads and eliminates the corruption. The TODO comment about "scheduling on the loop" explains the original intent: all network events should be dispatched on the libxev main thread rather than directly on the Rust thread. The mutex is a simpler intermediate fix that is fully correct; loop-based dispatch can be added later using xev.Async.
|
Can we just keep the lock only on the shared state access code path, current way looks like not good |
|
I’m not in favor of locking the state. A better approach may be to lock only the DB for writes and buffer any new writes until later, while still serving network calls. Locking state in order to serve network requests introduces a large security vector. I don’t think even DB-level locking should be necessary for reads. Block sync operates on blocks after finalization, so that data should already be immutable. |
Previously the lock was acquired at the top of every export fn, covering frame parsing, snappy decompression, SSZ deserialisation, and logging — none of which touch shared node state. Move each lock acquisition to immediately before the first actual shared- state access: - handleMsgFromRustBridge: lock just before gossipHandler.onGossip() - handleRPCRequestFromRustBridge: lock just before reqrespHandler.onReqRespRequest() - handleRPCResponseFromRustBridge: lock just before rpcCallbacks.getPtr() - handleRPCEndOfStreamFromRustBridge: lock just before rpcCallbacks.fetchRemove() - handleRPCErrorFromRustBridge: lock just before rpcCallbacks.fetchRemove() - peer handlers: lock just before peerEventHandler.on*() calls Read-only setup work (std.mem.span, enum casts, node_registry lookups, logging) now runs outside the lock, which is safe because: - The GPA allocator is thread-safe in Zig 0.15 multi-threaded builds. - node_registry is *const and immutable after init. - Frame / snappy / SSZ decode creates new heap objects, not shared state.
|
@GrapeBaBa Fixed in the latest commit. The lock is now acquired immediately before the first call that touches shared node state in each handler, and released right after:
The setup work (frame parsing, snappy decompression, SSZ deserialisation, span/enum casts, logging) all runs outside the lock. It is safe without the lock because the GPA allocator is thread-safe in Zig 0.15 multi-threaded builds, |
|
@anshalshukla Thanks for the detailed feedback. A few points: On "buffer writes and lock only the DB": That is exactly the On "block sync operates on finalized, immutable data": The race is not on finalized state. The crash happens in On "DB-level locking": The DB is fine; RocksDB is thread-safe. The issue is the in-memory Zig structures ( On "security vector": Happy to understand this concern better if you can expand on it. A mutex that serialises two threads accessing the same in-memory data structures is a standard correctness tool, not a security concern as far as I can see. The critical sections are narrow (we narrowed them further per GrapeBaBa's comment), so latency impact is minimal. |
Summary
EthLibp2p.state_mutexand locks it at the top of everyexport fncallback called from the Rust networking threadBeamNode.state_mutex(a pointer to the same mutex, null in unit-tests) and locks it at the top ofonIntervalon the main libxev threadblocks_by_rootrequest to avoid sequential round-tripsflushPendingParentFetchesin an early-return path ofprocessBlockByRootChunkRoot cause of the
integer overflowcrashThe node was crashing with:
Two threads, one shared state, no synchronisation
create_and_run_networkspawns a Rust/Tokio thread. Every network event — gossip messages, RPC responses, peer connect/disconnect — fires one of theexport fnhandlers inethlibp2p.zigon that Rust thread. Those handlers call directly into Zig node state:onGossip,processBlockByRootChunk,cacheBlockAndFetchParent,chain.onBlock, etc.At the same time, the main libxev thread runs
onIntervalon a timer and callsprocessCachedDescendants, which reads from the samefetched_blocksHashMap and walks the samechain/ fork-choice state.There was already a
TODOcomment in the code documenting the original intent to schedule these callbacks on the libxev loop instead:How the corruption produces an integer overflow
sszClone(used when caching a fetched block) deserialises into a freshly-allocatedSignedBlockWithAttestation. It writes struct fields sequentially —Bitlist.lengthfirst, then theinnerArrayList. If the main thread observes the struct between those two stores — or if thefetched_blocksHashMap is modified while being iterated on the main thread — it reads a torn state:lengthhas been written (or contains0xAAAAAAAAAAAAAAAAfrom GPA debug poisoning of a concurrently freed slot)inneris not yet consistentThat torn
lengththen flows into:serializedSizereturnsusize(not!usize), so Zig cannot propagate an error — it panics at the arithmetic.The fix
Add
std.Thread.Mutex state_mutextoEthLibp2p. Everyexport fnacquires it before touching any Zig state.BeamNode.onIntervalacquires the same mutex via a pointer stored inBeamNode.state_mutex(set to&network.state_mutexin production, null in unit-tests so test setup needs no changes).This serialises all shared-state access between the two threads and eliminates the corruption. The Rust thread holds the lock only for the duration of processing one network event (typically microseconds), so there is no meaningful throughput impact.
The long-term fix is to dispatch all network events through
xev.Asynconto the main libxev thread — which is what the existingscheduleOnLoopinfrastructure was designed for but never fully wired up. The mutex is the correct immediate solution;xev.Async-based dispatch can be added as a follow-up.Test plan
zig build— clean buildzig build test --summary all— all tests passzig fmt --checkon modified files — no formatting issues