Skip to content

fix: batch pending parent root fetches to avoid 300+ sequential round-trips#695

Open
ch4r10t33r wants to merge 5 commits intomainfrom
fix/batch-parent-fetches
Open

fix: batch pending parent root fetches to avoid 300+ sequential round-trips#695
ch4r10t33r wants to merge 5 commits intomainfrom
fix/batch-parent-fetches

Conversation

@ch4r10t33r
Copy link
Copy Markdown
Contributor

@ch4r10t33r ch4r10t33r commented Mar 25, 2026

Summary

  • Adds EthLibp2p.state_mutex and locks it at the top of every export fn callback called from the Rust networking thread
  • Adds BeamNode.state_mutex (a pointer to the same mutex, null in unit-tests) and locks it at the top of onInterval on the main libxev thread
  • Batches pending parent-root fetches into a single blocks_by_root request to avoid sequential round-trips
  • Fixes a missed flushPendingParentFetches in an early-return path of processBlockByRootChunk

Root cause of the integer overflow crash

The node was crashing with:

thread 1 panic: integer overflow
ssz/utils.zig  Bitlist.serializedSize
ssz/lib.zig    serialize
rocksdb.zig    putToBatch
node.zig       processCachedDescendants
node.zig       onInterval
io_uring.zig   run   ← main libxev thread

Two threads, one shared state, no synchronisation

create_and_run_network spawns a Rust/Tokio thread. Every network event — gossip messages, RPC responses, peer connect/disconnect — fires one of the export fn handlers in ethlibp2p.zig on that Rust thread. Those handlers call directly into Zig node state: onGossip, processBlockByRootChunk, cacheBlockAndFetchParent, chain.onBlock, etc.

At the same time, the main libxev thread runs onInterval on a timer and calls processCachedDescendants, which reads from the same fetched_blocks HashMap and walks the same chain / fork-choice state.

There was already a TODO comment in the code documenting the original intent to schedule these callbacks on the libxev loop instead:

// TODO: figure out why scheduling on the loop is not working
zigHandler.gossipHandler.onGossip(&message, sender_peer_id_slice, false) catch |e| { ... };

How the corruption produces an integer overflow

sszClone (used when caching a fetched block) deserialises into a freshly-allocated SignedBlockWithAttestation. It writes struct fields sequentially — Bitlist.length first, then the inner ArrayList. If the main thread observes the struct between those two stores — or if the fetched_blocks HashMap is modified while being iterated on the main thread — it reads a torn state:

  • length has been written (or contains 0xAAAAAAAAAAAAAAAA from GPA debug poisoning of a concurrently freed slot)
  • inner is not yet consistent

That torn length then flows into:

// Bitlist.serializedSize in ssz/utils.zig
pub fn serializedSize(self: *const Self) usize {
    return (self.length + 7 + 1) / 8;  // overflows when length ≈ usize::MAX
}

serializedSize returns usize (not !usize), so Zig cannot propagate an error — it panics at the arithmetic.

The fix

Add std.Thread.Mutex state_mutex to EthLibp2p. Every export fn acquires it before touching any Zig state. BeamNode.onInterval acquires the same mutex via a pointer stored in BeamNode.state_mutex (set to &network.state_mutex in production, null in unit-tests so test setup needs no changes).

This serialises all shared-state access between the two threads and eliminates the corruption. The Rust thread holds the lock only for the duration of processing one network event (typically microseconds), so there is no meaningful throughput impact.

The long-term fix is to dispatch all network events through xev.Async onto the main libxev thread — which is what the existing scheduleOnLoop infrastructure was designed for but never fully wired up. The mutex is the correct immediate solution; xev.Async-based dispatch can be added as a follow-up.

Test plan

  • zig build — clean build
  • zig build test --summary all — all tests pass
  • zig fmt --check on modified files — no formatting issues

When a block arrives with a missing parent, the old code immediately sent
an individual blocks_by_root request for that single parent root.  A syncing
peer walking a long parent chain (e.g. 300 slots back) would therefore open
300+ separate libp2p streams - one per ancestor - flooding both sides with
individual round-trips.

Replace the immediate fire-and-forget with a deferred queue:

- Add `pending_parent_roots: AutoHashMap(Root, depth)` to BeamNode.
- `cacheBlockAndFetchParent` now enqueues the missing parent root instead of
  calling `fetchBlockByRoots` directly.
- `flushPendingParentFetches` drains the map and issues a single batched
  blocks_by_root request for all accumulated roots.
- The flush is called at every natural exit point: after the missing-parent
  early-return in `onGossip`, at the end of `handleGossipProcessingResult`,
  and at the end of `processBlockByRootChunk`.

When multiple gossip blocks arrive in the same burst with the same missing
ancestor, all their parent roots are now collected and sent as one request
instead of N separate requests.
Copy link
Copy Markdown
Contributor

@zclawz zclawz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good optimization — clear problem, clean solution. A few observations:

1. roots.deinit(self.allocator) — likely a bug

std.ArrayList stores its allocator internally, so deinit() takes no arguments. This should be roots.deinit() (no parameter). With the extra argument this won't compile.

If you intended an unmanaged list, use std.ArrayListUnmanaged + deinit(allocator). But since self.allocator is already captured at initCapacity, plain ArrayList + deinit() is the right choice here.

2. max_depth across all batched roots

Using the maximum depth across all pending roots for the single batched request is conservative but correct — the worst case is we request a few more ancestors than strictly needed for shallower roots. Fine as a starting point; could be refined later if needed.

3. FetchFailedCachingFailed consolidation

Replacing the now-removed FetchFailed with CachingFailed (for the put allocation failure path) is reasonable — an allocation failure during enqueue is effectively a caching failure. Callers that handled FetchFailed may need updating if they had separate recovery logic, but since this is an internal error type it's unlikely to matter.

4. Flush coverage looks complete

The four flush sites (gossip early-return, handleGossipProcessingResult, processBlockByRootChunk end, and the RPC response handler) cover the natural exit points. Roots that accumulate on error paths will be flushed on the next message cycle, which is acceptable.

Fix the deinit call and this looks good to merge.

@ch4r10t33r
Copy link
Copy Markdown
Contributor Author

ch4r10t33r commented Mar 25, 2026

On point 1 : roots.deinit(self.allocator) is intentional and correct for this project.

This project targets Zig 0.15, where std.ArrayList.deinit() was changed to require an explicit allocator argument:

// Zig 0.15 stdlib — std/array_list.zig
pub fn deinit(self: *Self, gpa: Allocator) void {

Without the argument it fails to compile with:

error: member function expected 1 argument(s), found 0

The same deinit(self.allocator) pattern is used consistently throughout this file (6 call sites, e.g. missing_roots.deinit(self.allocator) in fetchBlockByRoots a few lines below). You may be thinking of Zig ≤ 0.13, where ArrayList stored the allocator internally and deinit() took no arguments.

@ch4r10t33r ch4r10t33r marked this pull request as ready for review March 25, 2026 17:55
The Rust networking thread (spawned by create_and_run_network) calls the
export fn callbacks directly, while the main libxev thread runs onInterval.
Both access shared state (fetched_blocks cache, chain, fork-choice, etc.)
concurrently with no synchronisation, causing heap corruption that manifests
as an integer overflow in Bitlist.serializedSize when a corrupted length
field is passed to ssz.serialize.

Fix: add EthLibp2p.state_mutex and acquire it at the top of every export fn
callback.  The main-thread onInterval also acquires the same mutex via the
BeamNode.state_mutex pointer (null in unit-tests, set to
&EthLibp2p.state_mutex in production).  This serialises all access to shared
state between the two threads and eliminates the corruption.

The TODO comment about "scheduling on the loop" explains the original intent:
all network events should be dispatched on the libxev main thread rather than
directly on the Rust thread.  The mutex is a simpler intermediate fix that is
fully correct; loop-based dispatch can be added later using xev.Async.
@ch4r10t33r ch4r10t33r requested review from GrapeBaBa and zclawz March 26, 2026 08:44
@GrapeBaBa
Copy link
Copy Markdown
Member

Can we just keep the lock only on the shared state access code path, current way looks like not good

@anshalshukla
Copy link
Copy Markdown
Collaborator

I’m not in favor of locking the state. A better approach may be to lock only the DB for writes and buffer any new writes until later, while still serving network calls. Locking state in order to serve network requests introduces a large security vector. I don’t think even DB-level locking should be necessary for reads. Block sync operates on blocks after finalization, so that data should already be immutable.

Previously the lock was acquired at the top of every export fn, covering
frame parsing, snappy decompression, SSZ deserialisation, and logging —
none of which touch shared node state.

Move each lock acquisition to immediately before the first actual shared-
state access:
- handleMsgFromRustBridge: lock just before gossipHandler.onGossip()
- handleRPCRequestFromRustBridge: lock just before reqrespHandler.onReqRespRequest()
- handleRPCResponseFromRustBridge: lock just before rpcCallbacks.getPtr()
- handleRPCEndOfStreamFromRustBridge: lock just before rpcCallbacks.fetchRemove()
- handleRPCErrorFromRustBridge: lock just before rpcCallbacks.fetchRemove()
- peer handlers: lock just before peerEventHandler.on*() calls

Read-only setup work (std.mem.span, enum casts, node_registry lookups,
logging) now runs outside the lock, which is safe because:
- The GPA allocator is thread-safe in Zig 0.15 multi-threaded builds.
- node_registry is *const and immutable after init.
- Frame / snappy / SSZ decode creates new heap objects, not shared state.
@ch4r10t33r
Copy link
Copy Markdown
Contributor Author

@GrapeBaBa Fixed in the latest commit. The lock is now acquired immediately before the first call that touches shared node state in each handler, and released right after:

  • handleMsgFromRustBridge — lock covers only gossipHandler.onGossip()
  • handleRPCRequestFromRustBridge — lock covers only reqrespHandler.onReqRespRequest()
  • handleRPCResponseFromRustBridge — lock covers only rpcCallbacks.getPtr() + callback_ptr.notify()
  • handleRPCEndOfStreamFromRustBridge / handleRPCErrorFromRustBridge — lock covers only rpcCallbacks.fetchRemove() + callback.notify()
  • Peer handlers — lock covers only the peerEventHandler.on*() call

The setup work (frame parsing, snappy decompression, SSZ deserialisation, span/enum casts, logging) all runs outside the lock. It is safe without the lock because the GPA allocator is thread-safe in Zig 0.15 multi-threaded builds, node_registry is *const and immutable after init, and everything else creates fresh heap objects with no shared-state reads or writes.

@ch4r10t33r
Copy link
Copy Markdown
Contributor Author

@anshalshukla Thanks for the detailed feedback. A few points:

On "buffer writes and lock only the DB": That is exactly the xev.Async approach already mentioned in the PR description as the long-term fix. The Rust callback would enqueue an event into a lock-free channel, and the main libxev thread would dequeue and process it on its own tick, with no mutex at all. That is the right architectural solution. This PR is the minimal safe patch for an active crash; the proper redesign is a follow-up once the immediate regression is resolved.

On "block sync operates on finalized, immutable data": The race is not on finalized state. The crash happens in fetched_blocks, which is an in-flight cache of pre-finalized blocks that are still being validated by the fork-choice and chain modules. These blocks are written by the Rust thread (when an RPC block-by-root response arrives) and read concurrently by the main libxev thread (during the interval tick that processes cached descendants). That shared mutable structure is exactly what the mutex serialises.

On "DB-level locking": The DB is fine; RocksDB is thread-safe. The issue is the in-memory Zig structures (fetched_blocks HashMap, the fork-choice tree, etc.) that are not thread-safe on their own.

On "security vector": Happy to understand this concern better if you can expand on it. A mutex that serialises two threads accessing the same in-memory data structures is a standard correctness tool, not a security concern as far as I can see. The critical sections are narrow (we narrowed them further per GrapeBaBa's comment), so latency impact is minimal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants