Skip to content

fix: skip pre-finalized attestations instead of aborting block import#692

Merged
ch4r10t33r merged 5 commits intomainfrom
fix/checkpoint-sync-chunked-body
Mar 25, 2026
Merged

fix: skip pre-finalized attestations instead of aborting block import#692
ch4r10t33r merged 5 commits intomainfrom
fix/checkpoint-sync-chunked-body

Conversation

@ch4r10t33r
Copy link
Copy Markdown
Contributor

@ch4r10t33r ch4r10t33r commented Mar 24, 2026

Problem

Two bugs prevented a checkpoint-synced node from ever catching up to the network head.

Bug 1: Skip pre-finalized attestations (already merged in previous commit)

process_attestations called try utils.IsJustifiableSlot(...), which returned StateTransitionError.InvalidJustifiableSlot for attestations referencing slots that are legal but fall outside a justifiable range. This aborted the entire block import. Fix: use catch false so invalid attestations are skipped rather than failing the block.

Bug 2: fc_initing deadlock (this PR)

After checkpoint sync the forkchoice starts in the .initing state and waits for a first justified checkpoint before declaring itself ready. The status-response sync handler was calling getSyncStatus() and treating fc_initing the same as .synced — doing nothing:

.synced, .no_peers, .fc_initing => {},

This created a hard deadlock:

  • The node needed blocks to exit fc_initing.
  • The sync code refused to fetch blocks while in fc_initing.

Fix

Status-response handler (node.zig): add an explicit fc_initing branch that requests the peer's head block when the peer's head slot is ahead of our anchor. This mirrors the behind_peers branch and breaks the deadlock.

Periodic sync refresh (node.zig, constants.zig): every SYNC_STATUS_REFRESH_INTERVAL_SLOTS (8 slots ≈ 32 s), re-send our status to all connected peers when in fc_initing or behind_peers. This recovers from the case where all peers were already connected before the fix was deployed — without this, no new connection event fires and the status-response handler is never re-triggered for existing peers.

Test plan

  • Deploy image, checkpoint-sync to a non-genesis slot, verify Head Slot advances past the anchor within ~30 s of startup
  • Verify the log line "peer … is ahead during fc init … requesting head block" appears in the logs
  • Verify forkchoice transitions from initing to ready and validator duties activate

ch4r10t33r and others added 3 commits March 23, 2026 15:25
When the SSZ state grows beyond ~3 MB the server switches from sending
a Content-Length response to Transfer-Encoding: chunked. The previous
body-reading loop called readSliceShort which internally goes through:

  readSliceShort → readVec → defaultReadVec → contentLengthStream

contentLengthStream accesses reader.state.body_remaining_content_length
but that field is not active for chunked responses (state is 'ready'),
causing a panic:

  thread 1 panic: access of union field 'body_remaining_content_length'
  while field 'ready' is active

Replace the manual request/response loop with client.fetch() using a
std.Io.Writer.Allocating as the response_writer. fetch() calls
response.readerDecompressing() + streamRemaining() which dispatches
through chunkedStream or contentLengthStream correctly based on the
actual transfer encoding used by the server.
After checkpoint sync, incoming blocks may carry attestations whose target
slots predate the finalized anchor. IsJustifiableSlot correctly identifies
these as non-justifiable, but the `try` was propagating the error fatally,
causing the entire block import to fail.

This creates a cascading gap: block N fails → blocks N+1..M fail (missing
parent) → no epoch-boundary attestations accumulate → justified checkpoint
never advances → forkchoice stays stuck in `initing` indefinitely.

Fix: catch InvalidJustifiableSlot and treat it as `false`. The attestation
is then silently skipped via the existing !is_target_justifiable check,
exactly as all other non-viable attestations (unknown source/target/head,
stale slot, etc.) are handled. The block imports successfully, the chain
catches up, and the node exits the initing state.

Update the test that was asserting the old (buggy) error-propagation
behaviour to instead assert that process_attestations succeeds.
@ch4r10t33r ch4r10t33r marked this pull request as ready for review March 24, 2026 21:38
@ch4r10t33r ch4r10t33r changed the title types: skip pre-finalized attestations instead of aborting block import fix: skip pre-finalized attestations instead of aborting block import Mar 24, 2026
After checkpoint sync the forkchoice starts in the initing state and
waits for a first justified checkpoint before declaring itself ready.
The status-response sync handler was checking getSyncStatus() and
treating fc_initing the same as synced — doing nothing.  This created
a deadlock: the node never requested blocks from ahead peers because
it was in fc_initing, and it could never leave fc_initing because no
blocks were imported.

Fix the deadlock in two places:

1. Status-response handler: add an explicit fc_initing branch that
   requests the peer's head block when the peer is ahead of our anchor
   slot.  This mirrors the behind_peers branch but uses head_slot for
   the comparison (finalized_slot is not yet meaningful in fc_initing).

2. Periodic sync refresh: every SYNC_STATUS_REFRESH_INTERVAL_SLOTS (8)
   slots, re-send our status to all connected peers when not synced.
   This recovers from the case where all peers were already connected
   before the fix was deployed, so no new connection event fires and
   the status-response handler would never be re-triggered.
@ch4r10t33r ch4r10t33r merged commit 437256b into main Mar 25, 2026
13 of 22 checks passed
@ch4r10t33r ch4r10t33r deleted the fix/checkpoint-sync-chunked-body branch March 25, 2026 08:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants