node: finalization stalls due to misalignment between target selection and can_target_finalize condition

## Problem Statement

On the Ansible-managed long-running devnet, the chain head advances normally — blocks are proposed and processed every slot — and `latest_justified` progresses (justification events fire), but `latest_finalized` never advances. Finalization is permanently stalled regardless of how long the devnet runs.

The symptom is asymmetric: attestation collection, block production, and justification all appear healthy; only finalization is stuck.

## Root Cause Analysis

### Cause 1 — Target selection is structurally misaligned with the finalization condition (primary)

The finalization condition in `process_attestations` (`state.zig` lines 492–502) requires that `target` is the **immediately next** justifiable slot after `source`, with **no other justifiable slots between them**:

```zig
// source is finalized if target is the next valid justifiable hash
var can_target_finalize = true;
const start_slot_usize: usize = @intCast(source_slot + 1);
const end_slot_usize: usize = @intCast(target_slot);
for (start_slot_usize..end_slot_usize) |slot_usize| {
    const slot: Slot = @intCast(slot_usize);
    if (try utils.IsJustifiableSlot(self.latest_finalized.slot, slot)) {
        can_target_finalize = false;
        break;
    }
}
```

Meanwhile, `getAttestationTargetUnlocked` (`forkchoice.zig`) selects the target by walking **backward from the current head** to find the most recent justifiable slot:

```zig
// walk at most 3 steps back from head toward safeTarget
for (0..3) |_| {
    if (nodes[target_idx].slot > self.safeTarget.slot) {
        target_idx = nodes[target_idx].parent orelse ...;
    }
}
// then find the first justifiable slot working backward
while (!try types.IsJustifiableSlot(self.fcStore.latest_finalized.slot, nodes[target_idx].slot)) {
    target_idx = nodes[target_idx].parent orelse ...;
}
```

These two goals are in conflict. Target selection finds the most *recent* justifiable slot anchored to the current head. The finalization condition requires the *immediately next* justifiable slot after source. On a long-running devnet where `latest_finalized` has fallen behind by many slots, the target selected by `getAttestationTargetUnlocked` is many justifiable steps ahead of `source`. The justifiable slot schedule from `IsJustifiableSlot` is:

```
Justifiable deltas from latest_finalized: 1, 2, 3, 4, 5, 6, 9, 12, 16, 20, 25, 30, 36, 42, 49, 56, 64, 72, 81, 90...
```

If `source = latest_finalized + 9` and `target = latest_finalized + 90`, there are intermediate justifiable slots at deltas 12, 16, 20, 25, ... between them. `can_target_finalize` is always `false`. Source is never finalized even though target just got justified.

For `can_target_finalize` to be `true`, target must be exactly one justifiable step ahead of source — but the target selection never guarantees this.

### Cause 2 — Starvation self-reinforces over time

Once finalization stalls, `latest_finalized` is frozen. With each new justification cycle, `latest_justified` advances but finalization remains impossible. The gap between `latest_finalized` and any selected target grows. More intermediate justifiable slots accumulate between source and target. `can_target_finalize` becomes progressively harder to satisfy. This is why the symptom is worse on the long-running ansible devnet than on short local runs.

### Cause 3 — Hardcoded 3-step walk in target selection is fragile

The three-step retreat from head toward `safeTarget` is unconditional:

```zig
for (0..3) |_| {
    if (nodes[target_idx].slot > self.safeTarget.slot) {
        target_idx = nodes[target_idx].parent orelse ...;
    }
}
```

If the head is more than 3 slots ahead of `safeTarget` (common during any slot burst or recovery), the walk stops early. The subsequent `while` loop then begins from a position still above `safeTarget`. Different nodes with slightly different head states start the `while` loop at different depths, leading to inconsistent target selection across the network — even when all nodes agree on the canonical chain.

### Cause 4 — Non-atomic fcStore snapshot during attestation construction

In `constructAttestationData` (`chain.zig`), `getAttestationTarget()` and `getLatestJustified()` are called as **two separate lock acquisitions**. If a block arrives and triggers a justification event between these two calls, the target is computed under one view of `latest_finalized` while the source comes from a newer view. This can produce attestation data where `source.slot > target.slot`, which the STF silently rejects.

### Cause 5 — Target slot may correspond to a missed block (zero hash)

`getAttestationTargetUnlocked` walks the ProtoArray tree, which only contains blocks that actually arrived. If the selected target slot had no block (missed slot), `historical_block_hashes` at that index is `ZERO_HASH`. The STF skips any attestation where source or target root is `ZERO_HASH`. Validators waste their duty for that slot without any error or warning.

## What Needs to Change

The most direct fix for Causes 1 and 2 is to change the target selection strategy. Instead of walking backward from the current head to find the most recent justifiable slot, `getAttestationTargetUnlocked` should find the **immediately next justifiable slot after `latest_justified`**. This is the only target that can ever satisfy `can_target_finalize`.

Cause 3 requires replacing the hardcoded `for (0..3)` with a proper loop that walks upward until `node.slot <= safeTarget.slot` unconditionally.

Cause 4 requires a single atomic snapshot of `fcStore.latest_finalized` and `fcStore.latest_justified` to be taken at the start of attestation construction, used consistently for both target and source.

Cause 5 requires verifying that the selected target slot has a non-zero block root before returning it, and walking further up if it doesn't.

## Affected Files

- `pkgs/node/src/forkchoice.zig` — `getAttestationTargetUnlocked`
- `pkgs/types/src/state.zig` — `process_attestations` / `can_target_finalize` loop
- `pkgs/node/src/chain.zig` — `constructAttestationData`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

node: finalization stalls due to misalignment between target selection and can_target_finalize condition #702

Problem Statement

Root Cause Analysis

Cause 1 — Target selection is structurally misaligned with the finalization condition (primary)

Cause 2 — Starvation self-reinforces over time

Cause 3 — Hardcoded 3-step walk in target selection is fragile

Cause 4 — Non-atomic fcStore snapshot during attestation construction

Cause 5 — Target slot may correspond to a missed block (zero hash)

What Needs to Change

Affected Files

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

node: finalization stalls due to misalignment between target selection and can_target_finalize condition #702

Description

Problem Statement

Root Cause Analysis

Cause 1 — Target selection is structurally misaligned with the finalization condition (primary)

Cause 2 — Starvation self-reinforces over time

Cause 3 — Hardcoded 3-step walk in target selection is fragile

Cause 4 — Non-atomic fcStore snapshot during attestation construction

Cause 5 — Target slot may correspond to a missed block (zero hash)

What Needs to Change

Affected Files

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions