fix: batch pending parent root fetches to avoid 300+ sequential round-trips by ch4r10t33r · Pull Request #695 · blockblaz/zeam

ch4r10t33r · 2026-03-25T17:32:55Z

Summary

Adds EthLibp2p.state_mutex and locks it at the top of every export fn callback called from the Rust networking thread
Adds BeamNode.state_mutex (a pointer to the same mutex, null in unit-tests) and locks it at the top of onInterval on the main libxev thread
Batches pending parent-root fetches into a single blocks_by_root request to avoid sequential round-trips
Fixes a missed flushPendingParentFetches in an early-return path of processBlockByRootChunk

Root cause of the `integer overflow` crash

The node was crashing with:

thread 1 panic: integer overflow
ssz/utils.zig  Bitlist.serializedSize
ssz/lib.zig    serialize
rocksdb.zig    putToBatch
node.zig       processCachedDescendants
node.zig       onInterval
io_uring.zig   run   ← main libxev thread

Two threads, one shared state, no synchronisation

create_and_run_network spawns a Rust/Tokio thread. Every network event — gossip messages, RPC responses, peer connect/disconnect — fires one of the export fn handlers in ethlibp2p.zig on that Rust thread. Those handlers call directly into Zig node state: onGossip, processBlockByRootChunk, cacheBlockAndFetchParent, chain.onBlock, etc.

At the same time, the main libxev thread runs onInterval on a timer and calls processCachedDescendants, which reads from the same fetched_blocks HashMap and walks the same chain / fork-choice state.

There was already a TODO comment in the code documenting the original intent to schedule these callbacks on the libxev loop instead:

// TODO: figure out why scheduling on the loop is not working
zigHandler.gossipHandler.onGossip(&message, sender_peer_id_slice, false) catch |e| { ... };

How the corruption produces an integer overflow

sszClone (used when caching a fetched block) deserialises into a freshly-allocated SignedBlockWithAttestation. It writes struct fields sequentially — Bitlist.length first, then the inner ArrayList. If the main thread observes the struct between those two stores — or if the fetched_blocks HashMap is modified while being iterated on the main thread — it reads a torn state:

length has been written (or contains 0xAAAAAAAAAAAAAAAA from GPA debug poisoning of a concurrently freed slot)
inner is not yet consistent

That torn length then flows into:

// Bitlist.serializedSize in ssz/utils.zig
pub fn serializedSize(self: *const Self) usize {
    return (self.length + 7 + 1) / 8;  // overflows when length ≈ usize::MAX
}

serializedSize returns usize (not !usize), so Zig cannot propagate an error — it panics at the arithmetic.

The fix

Add std.Thread.Mutex state_mutex to EthLibp2p. Every export fn acquires it before touching any Zig state. BeamNode.onInterval acquires the same mutex via a pointer stored in BeamNode.state_mutex (set to &network.state_mutex in production, null in unit-tests so test setup needs no changes).

This serialises all shared-state access between the two threads and eliminates the corruption. The Rust thread holds the lock only for the duration of processing one network event (typically microseconds), so there is no meaningful throughput impact.

The long-term fix is to dispatch all network events through xev.Async onto the main libxev thread — which is what the existing scheduleOnLoop infrastructure was designed for but never fully wired up. The mutex is the correct immediate solution; xev.Async-based dispatch can be added as a follow-up.

Test plan

zig build — clean build
zig build test --summary all — all tests pass
zig fmt --check on modified files — no formatting issues

When a block arrives with a missing parent, the old code immediately sent an individual blocks_by_root request for that single parent root. A syncing peer walking a long parent chain (e.g. 300 slots back) would therefore open 300+ separate libp2p streams - one per ancestor - flooding both sides with individual round-trips. Replace the immediate fire-and-forget with a deferred queue: - Add `pending_parent_roots: AutoHashMap(Root, depth)` to BeamNode. - `cacheBlockAndFetchParent` now enqueues the missing parent root instead of calling `fetchBlockByRoots` directly. - `flushPendingParentFetches` drains the map and issues a single batched blocks_by_root request for all accumulated roots. - The flush is called at every natural exit point: after the missing-parent early-return in `onGossip`, at the end of `handleGossipProcessingResult`, and at the end of `processBlockByRootChunk`. When multiple gossip blocks arrive in the same burst with the same missing ancestor, all their parent roots are now collected and sent as one request instead of N separate requests.

zclawz

Good optimization — clear problem, clean solution. A few observations:

1. roots.deinit(self.allocator) — likely a bug

std.ArrayList stores its allocator internally, so deinit() takes no arguments. This should be roots.deinit() (no parameter). With the extra argument this won't compile.

If you intended an unmanaged list, use std.ArrayListUnmanaged + deinit(allocator). But since self.allocator is already captured at initCapacity, plain ArrayList + deinit() is the right choice here.

2. max_depth across all batched roots

Using the maximum depth across all pending roots for the single batched request is conservative but correct — the worst case is we request a few more ancestors than strictly needed for shallower roots. Fine as a starting point; could be refined later if needed.

3. FetchFailed → CachingFailed consolidation

Replacing the now-removed FetchFailed with CachingFailed (for the put allocation failure path) is reasonable — an allocation failure during enqueue is effectively a caching failure. Callers that handled FetchFailed may need updating if they had separate recovery logic, but since this is an internal error type it's unlikely to matter.

4. Flush coverage looks complete

The four flush sites (gossip early-return, handleGossipProcessingResult, processBlockByRootChunk end, and the RPC response handler) cover the natural exit points. Roots that accumulate on error paths will be flushed on the next message cycle, which is acceptable.

Fix the deinit call and this looks good to merge.

ch4r10t33r · 2026-03-25T17:46:48Z

On point 1 : roots.deinit(self.allocator) is intentional and correct for this project.

This project targets Zig 0.15, where std.ArrayList.deinit() was changed to require an explicit allocator argument:

// Zig 0.15 stdlib — std/array_list.zig
pub fn deinit(self: *Self, gpa: Allocator) void {

Without the argument it fails to compile with:

error: member function expected 1 argument(s), found 0

The same deinit(self.allocator) pattern is used consistently throughout this file (6 call sites, e.g. missing_roots.deinit(self.allocator) in fetchBlockByRoots a few lines below). You may be thinking of Zig ≤ 0.13, where ArrayList stored the allocator internally and deinit() took no arguments.

The Rust networking thread (spawned by create_and_run_network) calls the export fn callbacks directly, while the main libxev thread runs onInterval. Both access shared state (fetched_blocks cache, chain, fork-choice, etc.) concurrently with no synchronisation, causing heap corruption that manifests as an integer overflow in Bitlist.serializedSize when a corrupted length field is passed to ssz.serialize. Fix: add EthLibp2p.state_mutex and acquire it at the top of every export fn callback. The main-thread onInterval also acquires the same mutex via the BeamNode.state_mutex pointer (null in unit-tests, set to &EthLibp2p.state_mutex in production). This serialises all access to shared state between the two threads and eliminates the corruption. The TODO comment about "scheduling on the loop" explains the original intent: all network events should be dispatched on the libxev main thread rather than directly on the Rust thread. The mutex is a simpler intermediate fix that is fully correct; loop-based dispatch can be added later using xev.Async.

GrapeBaBa · 2026-03-26T09:05:40Z

Can we just keep the lock only on the shared state access code path, current way looks like not good

anshalshukla · 2026-03-26T09:15:35Z

I’m not in favor of locking the state. A better approach may be to lock only the DB for writes and buffer any new writes until later, while still serving network calls. Locking state in order to serve network requests introduces a large security vector. I don’t think even DB-level locking should be necessary for reads. Block sync operates on blocks after finalization, so that data should already be immutable.

Previously the lock was acquired at the top of every export fn, covering frame parsing, snappy decompression, SSZ deserialisation, and logging — none of which touch shared node state. Move each lock acquisition to immediately before the first actual shared- state access: - handleMsgFromRustBridge: lock just before gossipHandler.onGossip() - handleRPCRequestFromRustBridge: lock just before reqrespHandler.onReqRespRequest() - handleRPCResponseFromRustBridge: lock just before rpcCallbacks.getPtr() - handleRPCEndOfStreamFromRustBridge: lock just before rpcCallbacks.fetchRemove() - handleRPCErrorFromRustBridge: lock just before rpcCallbacks.fetchRemove() - peer handlers: lock just before peerEventHandler.on*() calls Read-only setup work (std.mem.span, enum casts, node_registry lookups, logging) now runs outside the lock, which is safe because: - The GPA allocator is thread-safe in Zig 0.15 multi-threaded builds. - node_registry is *const and immutable after init. - Frame / snappy / SSZ decode creates new heap objects, not shared state.

ch4r10t33r · 2026-03-26T12:53:34Z

@GrapeBaBa Fixed in the latest commit. The lock is now acquired immediately before the first call that touches shared node state in each handler, and released right after:

handleMsgFromRustBridge — lock covers only gossipHandler.onGossip()
handleRPCRequestFromRustBridge — lock covers only reqrespHandler.onReqRespRequest()
handleRPCResponseFromRustBridge — lock covers only rpcCallbacks.getPtr() + callback_ptr.notify()
handleRPCEndOfStreamFromRustBridge / handleRPCErrorFromRustBridge — lock covers only rpcCallbacks.fetchRemove() + callback.notify()
Peer handlers — lock covers only the peerEventHandler.on*() call

The setup work (frame parsing, snappy decompression, SSZ deserialisation, span/enum casts, logging) all runs outside the lock. It is safe without the lock because the GPA allocator is thread-safe in Zig 0.15 multi-threaded builds, node_registry is *const and immutable after init, and everything else creates fresh heap objects with no shared-state reads or writes.

ch4r10t33r · 2026-03-26T12:53:51Z

@anshalshukla Thanks for the detailed feedback. A few points:

On "buffer writes and lock only the DB": That is exactly the xev.Async approach already mentioned in the PR description as the long-term fix. The Rust callback would enqueue an event into a lock-free channel, and the main libxev thread would dequeue and process it on its own tick, with no mutex at all. That is the right architectural solution. This PR is the minimal safe patch for an active crash; the proper redesign is a follow-up once the immediate regression is resolved.

On "block sync operates on finalized, immutable data": The race is not on finalized state. The crash happens in fetched_blocks, which is an in-flight cache of pre-finalized blocks that are still being validated by the fork-choice and chain modules. These blocks are written by the Rust thread (when an RPC block-by-root response arrives) and read concurrently by the main libxev thread (during the interval tick that processes cached descendants). That shared mutable structure is exactly what the mutex serialises.

On "DB-level locking": The DB is fine; RocksDB is thread-safe. The issue is the in-memory Zig structures (fetched_blocks HashMap, the fork-choice tree, etc.) that are not thread-safe on their own.

On "security vector": Happy to understand this concern better if you can expand on it. A mutex that serialises two threads accessing the same in-memory data structures is a standard correctness tool, not a security concern as far as I can see. The critical sections are narrow (we narrowed them further per GrapeBaBa's comment), so latency impact is minimal.

ch4r10t33r added 2 commits March 25, 2026 17:32

node: fix missed flush in processBlockByRootChunk early return path

00c8325

zclawz reviewed Mar 25, 2026

View reviewed changes

ch4r10t33r marked this pull request as ready for review March 25, 2026 17:55

ch4r10t33r requested review from anshalshukla and g11tech March 25, 2026 17:55

zclawz mentioned this pull request Mar 26, 2026

Panic on checkpoint-sync restart: 'no reactor running' in delay_map (Tokio runtime context missing) #699

Open

Merge branch 'main' into fix/batch-parent-fetches

9bfe38b

ch4r10t33r requested review from GrapeBaBa and zclawz March 26, 2026 08:44

This was referenced Mar 26, 2026

Follow-up: Replace state_mutex with xev.Async lock-free dispatch for Rust→Zig network callbacks #700

Open

Follow-up: Replace state_mutex with xev.Async lock-free dispatch for Rust→Zig network callbacks #701

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: batch pending parent root fetches to avoid 300+ sequential round-trips#695

fix: batch pending parent root fetches to avoid 300+ sequential round-trips#695
ch4r10t33r wants to merge 5 commits intomainfrom
fix/batch-parent-fetches

ch4r10t33r commented Mar 25, 2026 •

edited

Loading

Uh oh!

zclawz left a comment

Uh oh!

ch4r10t33r commented Mar 25, 2026 •

edited

Loading

Uh oh!

GrapeBaBa commented Mar 26, 2026

Uh oh!

anshalshukla commented Mar 26, 2026

Uh oh!

ch4r10t33r commented Mar 26, 2026

Uh oh!

ch4r10t33r commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ch4r10t33r commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause of the integer overflow crash

Two threads, one shared state, no synchronisation

How the corruption produces an integer overflow

The fix

Test plan

Uh oh!

zclawz left a comment

Choose a reason for hiding this comment

Uh oh!

ch4r10t33r commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

GrapeBaBa commented Mar 26, 2026

Uh oh!

anshalshukla commented Mar 26, 2026

Uh oh!

ch4r10t33r commented Mar 26, 2026

Uh oh!

ch4r10t33r commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ch4r10t33r commented Mar 25, 2026 •

edited

Loading

Root cause of the `integer overflow` crash

ch4r10t33r commented Mar 25, 2026 •

edited

Loading