Skip to content

node: finalization stalls due to misalignment between target selection and can_target_finalize condition #702

@ch4r10t33r

Description

@ch4r10t33r

Problem Statement

On the Ansible-managed long-running devnet, the chain head advances normally — blocks are proposed and processed every slot — and latest_justified progresses (justification events fire), but latest_finalized never advances. Finalization is permanently stalled regardless of how long the devnet runs.

The symptom is asymmetric: attestation collection, block production, and justification all appear healthy; only finalization is stuck.

Root Cause Analysis

Cause 1 — Target selection is structurally misaligned with the finalization condition (primary)

The finalization condition in process_attestations (state.zig lines 492–502) requires that target is the immediately next justifiable slot after source, with no other justifiable slots between them:

// source is finalized if target is the next valid justifiable hash
var can_target_finalize = true;
const start_slot_usize: usize = @intCast(source_slot + 1);
const end_slot_usize: usize = @intCast(target_slot);
for (start_slot_usize..end_slot_usize) |slot_usize| {
    const slot: Slot = @intCast(slot_usize);
    if (try utils.IsJustifiableSlot(self.latest_finalized.slot, slot)) {
        can_target_finalize = false;
        break;
    }
}

Meanwhile, getAttestationTargetUnlocked (forkchoice.zig) selects the target by walking backward from the current head to find the most recent justifiable slot:

// walk at most 3 steps back from head toward safeTarget
for (0..3) |_| {
    if (nodes[target_idx].slot > self.safeTarget.slot) {
        target_idx = nodes[target_idx].parent orelse ...;
    }
}
// then find the first justifiable slot working backward
while (!try types.IsJustifiableSlot(self.fcStore.latest_finalized.slot, nodes[target_idx].slot)) {
    target_idx = nodes[target_idx].parent orelse ...;
}

These two goals are in conflict. Target selection finds the most recent justifiable slot anchored to the current head. The finalization condition requires the immediately next justifiable slot after source. On a long-running devnet where latest_finalized has fallen behind by many slots, the target selected by getAttestationTargetUnlocked is many justifiable steps ahead of source. The justifiable slot schedule from IsJustifiableSlot is:

Justifiable deltas from latest_finalized: 1, 2, 3, 4, 5, 6, 9, 12, 16, 20, 25, 30, 36, 42, 49, 56, 64, 72, 81, 90...

If source = latest_finalized + 9 and target = latest_finalized + 90, there are intermediate justifiable slots at deltas 12, 16, 20, 25, ... between them. can_target_finalize is always false. Source is never finalized even though target just got justified.

For can_target_finalize to be true, target must be exactly one justifiable step ahead of source — but the target selection never guarantees this.

Cause 2 — Starvation self-reinforces over time

Once finalization stalls, latest_finalized is frozen. With each new justification cycle, latest_justified advances but finalization remains impossible. The gap between latest_finalized and any selected target grows. More intermediate justifiable slots accumulate between source and target. can_target_finalize becomes progressively harder to satisfy. This is why the symptom is worse on the long-running ansible devnet than on short local runs.

Cause 3 — Hardcoded 3-step walk in target selection is fragile

The three-step retreat from head toward safeTarget is unconditional:

for (0..3) |_| {
    if (nodes[target_idx].slot > self.safeTarget.slot) {
        target_idx = nodes[target_idx].parent orelse ...;
    }
}

If the head is more than 3 slots ahead of safeTarget (common during any slot burst or recovery), the walk stops early. The subsequent while loop then begins from a position still above safeTarget. Different nodes with slightly different head states start the while loop at different depths, leading to inconsistent target selection across the network — even when all nodes agree on the canonical chain.

Cause 4 — Non-atomic fcStore snapshot during attestation construction

In constructAttestationData (chain.zig), getAttestationTarget() and getLatestJustified() are called as two separate lock acquisitions. If a block arrives and triggers a justification event between these two calls, the target is computed under one view of latest_finalized while the source comes from a newer view. This can produce attestation data where source.slot > target.slot, which the STF silently rejects.

Cause 5 — Target slot may correspond to a missed block (zero hash)

getAttestationTargetUnlocked walks the ProtoArray tree, which only contains blocks that actually arrived. If the selected target slot had no block (missed slot), historical_block_hashes at that index is ZERO_HASH. The STF skips any attestation where source or target root is ZERO_HASH. Validators waste their duty for that slot without any error or warning.

What Needs to Change

The most direct fix for Causes 1 and 2 is to change the target selection strategy. Instead of walking backward from the current head to find the most recent justifiable slot, getAttestationTargetUnlocked should find the immediately next justifiable slot after latest_justified. This is the only target that can ever satisfy can_target_finalize.

Cause 3 requires replacing the hardcoded for (0..3) with a proper loop that walks upward until node.slot <= safeTarget.slot unconditionally.

Cause 4 requires a single atomic snapshot of fcStore.latest_finalized and fcStore.latest_justified to be taken at the start of attestation construction, used consistently for both target and source.

Cause 5 requires verifying that the selected target slot has a non-zero block root before returning it, and walking further up if it doesn't.

Affected Files

  • pkgs/node/src/forkchoice.ziggetAttestationTargetUnlocked
  • pkgs/types/src/state.zigprocess_attestations / can_target_finalize loop
  • pkgs/node/src/chain.zigconstructAttestationData

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions