-
Notifications
You must be signed in to change notification settings - Fork 35
Description
Background
PR #695 introduced EthLibp2p.state_mutex as a minimal safe patch to fix a race condition (heap corruption) between the Rust networking thread and the main libxev event-loop thread. It works correctly and resolves the immediate crash.
However, the PR description itself acknowledges this is an intermediate fix:
On "buffer writes and lock only the DB": That is exactly the
xev.Asyncapproach already mentioned in the PR description as the long-term fix. The Rust callback would enqueue an event into a lock-free channel, and the main libxev thread would dequeue and process it on its own tick, with no mutex at all. That is the right architectural solution. This PR is the minimal safe patch for an active crash; the proper redesign is a follow-up once the immediate regression is resolved.
Proposed Architecture
Replace the mutex-based synchronisation with an xev.Async-based event dispatch model:
- Rust networking callbacks enqueue events into a lock-free channel (e.g.
std.atomic.Queueor a ring buffer) instead of directly calling into shared Zig state - The main libxev event-loop thread dequeues and processes these events on its own tick via
xev.Async - No mutex is needed — all shared state is only ever touched from the single libxev thread
Benefits
- Eliminates lock contention between the Rust thread and the Zig event loop
- Cleaner architectural boundary between networking (Rust) and consensus logic (Zig)
- Better scalability — no blocking on either side
- Aligns with the libxev/async-IO design philosophy
Acceptance Criteria
- Rust export fn callbacks enqueue events rather than directly accessing
BeamNode/ chain state -
xev.Asyncwatcher is registered on the main libxev loop to drain the event queue -
EthLibp2p.state_mutexandBeamNode.state_mutexare removed - Existing behaviour (block processing, fork choice, attestation handling) is preserved
- Unit tests pass without the mutex
References
- PR fix: batch pending parent root fetches to avoid 300+ sequential round-trips #695 (introduced the mutex as interim fix)
- Issue Panic on checkpoint-sync restart: 'no reactor running' in delay_map (Tokio runtime context missing) #699 (original crash that triggered fix: batch pending parent root fetches to avoid 300+ sequential round-trips #695)