Skip to content

fix(forkchoice): release mutex during aggregation proof building to prevent consensus stall#196

Merged
dimka90 merged 1 commit intomainfrom
fix/aggregation-mutex-contention
Apr 2, 2026
Merged

fix(forkchoice): release mutex during aggregation proof building to prevent consensus stall#196
dimka90 merged 1 commit intomainfrom
fix/aggregation-mutex-contention

Conversation

@dimka90
Copy link
Copy Markdown
Collaborator

@dimka90 dimka90 commented Apr 1, 2026

Summary

  • Split AggregateCommitteeSignatures() into three lock/unlock/lock phases so the expensive leanmultisig.Aggregate() FFI call (500ms–11s) runs without holding the fork-choice mutex
  • Consensus operations (block processing, attestation handling, time advances) continue during proof building
  • Block production (produce.go) retains the synchronous locked path via buildAggregatedAttestationsFromSignedLocked() since proofs are needed before publishing

Context

When gean runs as aggregator, AggregateCommitteeSignatures() held the global fork-choice mutex for the entire duration of proof building. With 6 participants this took 11+ seconds, blocking all consensus operations and causing gean to miss 2–3 slots per aggregation cycle. The effect snowballed — missed slots accumulated more attestations, making the next aggregation even slower.

Confirmed by devnet-3 logs and docker stats: gean used 322% CPU as aggregator while non-aggregator clients used under 2%.

After this fix, gean still syncs ~1 block occasionally when aggregation takes 3+ seconds (the proof time is unchanged — that's the upstream leanmultisig library cost), but it no longer falls behind or misses multiple slots. Without --is-aggregator, gean does not sync at all.

Pattern adopted from zeam's separation of signatures_mutex from the forkchoice lock (forkchoice.zig:308).


Test Plan

  • go build ./... compiles cleanly
  • go vet ./... passes
  • go test ./chain/forkchoice/... -count=1 — forkchoice tests pass
  • go test -race ./chain/forkchoice/... — no data races
  • Local 3-node devnet with --is-aggregator — gean stays at head, no multi-slot sync gaps
  • Verify aggregation complete duration unchanged (proof time is not reduced, only unblocked)
  • Verify behind_peers=false after initial catch-up
  • Verify gean can propose blocks while running as aggregator
    Closes fix(forkchoice): release mutex during aggregation proof building to prevent consensus stall #195

@dimka90 dimka90 requested a review from devylongs April 1, 2026 23:33
@dimka90 dimka90 requested review from morelucks and shaaibu7 April 2, 2026 09:13
AggregateCommitteeSignatures() held the global fork-choice mutex for the
entire duration of leanmultisig.Aggregate() FFI calls (500ms-11s). This
blocked all consensus operations — block processing, attestation handling,
and time advances — causing gean to fall behind and enter a sync loop.

Split into three phases:
1. Lock: collect inputs (pubkeys, signatures, data) from gossip cache
2. Unlock: run expensive leanmultisig.Aggregate() FFI calls
3. Lock: store resulting proofs back into cache

Block production (produce.go) retains the synchronous locked path via
buildAggregatedAttestationsFromSignedLocked() since it needs proofs
before publishing the block.

Matches zeam's pattern of separating signatures_mutex from the forkchoice
lock (forkchoice.zig:308).
@dimka90 dimka90 force-pushed the fix/aggregation-mutex-contention branch from 31bc702 to 6da5b96 Compare April 2, 2026 10:01
@dimka90 dimka90 merged commit 5cb2406 into main Apr 2, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix(forkchoice): release mutex during aggregation proof building to prevent consensus stall

3 participants