Skip to content

Defer state/block pruning until after block cascade completes#240

Merged
pablodeymo merged 2 commits intomainfrom
prune-at-after-sync
Mar 17, 2026
Merged

Defer state/block pruning until after block cascade completes#240
pablodeymo merged 2 commits intomainfrom
prune-at-after-sync

Conversation

@pablodeymo
Copy link
Collaborator

Motivation

During the devnet4 run (2026-03-13), all three ethlambda nodes entered an infinite re-processing loop at slot ~15276, generating ~3.5GB of logs each and consuming 100% CPU for hours.

This PR fixes the root cause by deferring heavy state/block pruning until after a block processing cascade completes, so parent states survive long enough for their children to be processed.

Root Cause

The infinite loop is caused by fallback pruning running inside the block processing cascade, deleting states that pending children still need.

The three interacting mechanisms

1. Asymmetric retention creates a state-header gap

When finalization stalls, fallback pruning keeps only STATES_TO_KEEP=900 states but BLOCKS_TO_KEEP=21600 headers. Block headers exist in DB without their states.

2. Chain walk reaches protected checkpoints

When a block arrives with a missing parent, process_or_pend_block walks ancestor headers looking for one whose parent has state. Protected checkpoints (justified/finalized) always have state, so the walk can reach blocks thousands of slots behind head.

3. Mid-cascade pruning deletes just-computed states

on_block_core calls update_checkpoints after every block, which runs prune_old_states. States for old slots (far behind head) are immediately deleted — even if they were just computed milliseconds ago by the same cascade.

The loop

                    ┌──────────────────────────────────────────────┐
                    │                                              │
                    ▼                                              │
1. Chain walk finds block 15266 (parent=4dda, justified)          │
   → parent state exists (protected) → enqueue for processing     │
                    │                                              │
2. Cascade processes 15266 → 15269 → ... → 15276                 │
   → states computed and stored                                   │
                    │                                              │
3. Each on_block_core calls update_checkpoints                    │
   → fallback pruning runs → states for slots 15266-15276        │
     are IMMEDIATELY deleted (slot < head - 900)                  │
                    │                                              │
4. collect_pending_children(15276) finds block 15278              │
   → process_or_pend_block(15278)                                 │
   → has_state(parent=15276) → FALSE (just pruned!)               │
   → stores as pending                                            │
                    │                                              │
5. Chain walk for 15278 re-discovers 15266                        │
   → parent 4dda still has state (protected)                      │
   → enqueue 15266 ─────────────────────────────────────────────→─┘

How it was triggered in devnet4

  1. 9 validators, 7 clients. Finalization stalled at slot 15261 due to a fork at slot 15264 (qlean diverged).
  2. At ~10:13:40 UTC, qlean's alternate fork blocks arrived at ethlambda via gossip.
  3. The chain walk for these blocks traversed ~2000 slots back to the justified checkpoint.
  4. The cascade re-processed blocks 15266→15276, but fallback pruning deleted each state immediately.
  5. All three ethlambda nodes (validators 6, 7, 8) entered the loop simultaneously.

Solution

Defer heavy pruning (states + blocks) until after the block cascade completes.

Before (pruning runs per-block, mid-cascade)

on_block
  └─ while queue:
       └─ process_or_pend_block
            └─ on_block_core
                 └─ update_checkpoints
                      ├─ write metadata          ← immediate
                      ├─ prune_live_chain        ← immediate
                      ├─ prune_gossip_signatures ← immediate
                      ├─ prune_old_states        ← DELETES PARENT STATES MID-CASCADE
                      └─ prune_old_blocks        ← DELETES BLOCK DATA MID-CASCADE

After (pruning deferred to end of cascade)

on_block
  └─ while queue:
  │    └─ process_or_pend_block
  │         └─ on_block_core
  │              └─ update_checkpoints
  │                   ├─ write metadata          ← immediate
  │                   ├─ prune_live_chain        ← immediate (fork choice correctness)
  │                   ├─ prune_gossip_signatures ← immediate (cheap)
  │                   └─ (no state/block pruning)
  │
  └─ store.prune_old_data()                      ← runs ONCE after cascade

Split of update_checkpoints

Operation Where it runs Why
Write head/justified/finalized metadata update_checkpoints (per-block) Checkpoints must be current for fork choice
prune_live_chain update_checkpoints (per-block) Affects fork choice traversal
prune_gossip_signatures update_checkpoints (per-block) Cheap, correctness-related
prune_attestation_data_by_root update_checkpoints (per-block) Cheap, correctness-related
prune_old_states prune_old_data (after cascade) Heavy, causes infinite loop if mid-cascade
prune_old_blocks prune_old_data (after cascade) Heavy, coupled with state pruning

Why this fixes the loop

With deferred pruning, the devnet4 scenario plays out safely:

  1. Cascade processes 15266 → 15269 → ... → 15276 → states are KEPT (no pruning mid-cascade)
  2. collect_pending_children(15276) finds 15278 → has_state(parent=15276)TRUE (state still exists)
  3. 15278 processes successfully, cascade continues through children
  4. Queue empties, while loop ends
  5. prune_old_data() runs once — deletes old states
  6. Cascade is already done — no one re-triggers it

Cross-client validation

We surveyed how other lean consensus clients handle this (Lighthouse, Zeam, Ream, Qlean, Lantern, Grandine). None of them prune states mid-cascade. Common patterns:

  • Zeam: Canonicality-based pruning, only after finalization or after long stalls (14,400 slots). Never during block processing.
  • Ream: Prunes one state per tick (not during block import).
  • Grandine: Never prunes states (in-memory forever).
  • Lighthouse: Background migrator thread, decoupled from block import.

Changes

  • crates/storage/src/store.rs: Split update_checkpoints — extract prune_old_states/prune_old_blocks into new prune_old_data() method. Lightweight pruning (live chain, signatures, attestation data) stays in update_checkpoints.
  • crates/blockchain/src/lib.rs: Call store.prune_old_data() once after the on_block while loop completes.
  • Tests: Updated fallback_pruning_* tests to call prune_old_data() explicitly.

How to Test

  1. make test — all 125 tests pass including 27 fork choice spec tests
  2. Deploy to devnet with a multi-client setup where finalization stalls and alternate fork blocks arrive
  3. Verify ethlambda nodes do not enter re-processing loops (no repeated "Block imported successfully" for the same slot in logs)
  4. Monitor memory during long finalization stalls — temporary state accumulation during cascades is bounded by cascade size

@github-actions
Copy link

🤖 Kimi Code Review

Review Summary

The PR correctly separates heavy pruning from checkpoint updates to prevent state deletion during block cascades. This is a critical fix for consensus correctness.

Issues Found

  1. Race condition in prune_old_data (storage/src/store.rs:503-510)

    • The method takes &mut self but uses self.latest_finalized() and self.latest_justified() which internally lock the store
    • This creates a potential deadlock if called while other store methods hold locks
    • Fix: Store the roots in local variables before pruning:
    pub fn prune_old_data(&mut self) {
        let finalized_root = self.latest_finalized().root;
        let justified_root = self.latest_justified().root;
        let protected_roots = [finalized_root, justified_root];
        // ... rest of method
    }
  2. Test regression (storage/src/store.rs:1480-1490)

    • The test comment says "905 headers total" but the assertion checks for 902 states
    • After the change, the test needs to call prune_old_data() to maintain the same behavior
    • Fix: Update the comment to reflect the actual expected count
  3. Missing documentation (blockchain/src/lib.rs:304-307)

    • The comment mentions "fallback pruning" but this term isn't defined in the codebase
    • Fix: Clarify that this refers to the deferred pruning mechanism

Security & Correctness

  • ✅ The separation prevents consensus failures from premature state deletion
  • ✅ Protected roots correctly exclude finalized/justified states from pruning
  • ✅ The change maintains the same pruning logic, just defers execution

Performance

  • ✅ Defers expensive I/O operations until after block processing completes
  • ✅ Reduces redundant work during block cascades
  • ⚠️ The race condition could cause lock contention under high load

The PR is approved with fixes - the race condition should be addressed before merge.


Automated review by Kimi (Moonshot AI) · custom prompt

@github-actions
Copy link

🤖 Codex Code Review

  1. High – pruning can be skipped for locally produced blocks, allowing unbounded DB growth (availability risk).
    Heavy pruning is now only invoked in network block handling (on_block), not in the local proposal path that directly calls process_block.

    If this node does not receive its own published blocks back from gossip, prune_old_data() may never run on proposer-driven progress, so States/Block* can grow without bound under stalled finalization.

    Suggested fix: call self.store.prune_old_data() after successful local process_block in the proposal flow, or centralize pruning in the block-processing core path that runs for both local and network imports.

Overall, the PR’s core intent (avoiding mid-cascade pruning loops) is sound, and I didn’t spot direct fork-choice/justification/finalization logic regressions in this diff beyond the pruning coverage gap above.


Automated review by OpenAI Codex · custom prompt

@github-actions
Copy link

🤖 Claude Code Review

Now I have enough context for a full review.


PR Review: Defer state/block pruning until after block cascade completes

Summary: This is a targeted, well-diagnosed fix for a real production bug. The root cause analysis in the PR description is excellent, and the change is minimal and focused. The fix correctly identifies that mid-cascade pruning can invalidate state that pending children still need.


Correctness

The fix is correct. The core issue — prune_old_states deleting states for recently-processed blocks before their children can use them — is cleanly addressed by moving heavy pruning outside the cascade loop.

The split in update_checkpoints is well-considered:

  • prune_live_chain must remain immediate (fork choice traversal depends on it).
  • prune_gossip_signatures / prune_attestation_data_by_root are cheap and correctness-relevant, so keeping them immediate is fine.
  • prune_old_states / prune_old_blocks are the dangerous ones during cascades and are correctly deferred.

Minor Observations

Tick path no longer triggers state/block pruning.

on_tick calls update_checkpoints(ForkCheckpoints::head_only(...)), which means checkpoints.finalized is None, so the lightweight pruning block inside update_checkpoints doesn't trigger either. And since prune_old_data() is only called from on_block, ticks never drive state/block pruning. This is fine in practice — states only accumulate when blocks arrive — but worth being aware of: if there's ever a tick-only path that produces states, it would bypass pruning entirely.

prune_old_data() runs unconditionally after every cascade (lib.rs:306), even during normal finalization when storage is within retention limits. This is harmless because both prune_old_states and prune_old_blocks have entries.len() <= STATES_TO_KEEP / BLOCKS_TO_KEEP early-return guards, but it does mean two full BlockHeaders table scans per block under normal conditions. This is a pre-existing inefficiency (the two functions scan the same table independently), not introduced here, but worth noting for future optimization.

Observability regression (minor): The old code distinguished "Pruned finalized data" (with pruned_states, pruned_blocks fields) from "Fallback pruning (finalization stalled)". The new prune_old_data() emits "Pruned old states and blocks" for both cases with no indication of why pruning ran. During a devnet incident, this distinction is useful for diagnosing whether finalization is healthy. Consider adding a field like finalization_advancing: bool or a separate log line.

Memory growth during large cascades: The PR description acknowledges this — states for the full cascade depth are held until the cascade completes. For the devnet4 scenario (~10 slots), this is negligible. In theory a very deep cascade (thousands of slots, as in the chain walk scenario) could hold thousands of states temporarily. This is bounded and correct, just worth monitoring in practice.

Tests

The test updates are correct. The new assertion at store.rs:1483 that verifies update_checkpoints alone does not prune (before prune_old_data() is called) is a good addition that directly documents and guards the new contract.


Summary

The change is sound. The fix directly addresses the infinite-loop root cause without unnecessary refactoring. The PR description is thorough and the cross-client comparison adds confidence. The main things worth tracking operationally are temporary memory growth during large cascades and the minor observability loss in the pruning log message.


Automated review by Claude (Anthropic) · custom prompt

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 17, 2026

Greptile Summary

This PR fixes a critical infinite re-processing loop observed on devnet4 by deferring heavy state/block pruning (prune_old_states, prune_old_blocks) until after the full block cascade completes, rather than running it per-block inside update_checkpoints. The root cause was that mid-cascade pruning deleted states that pending children still needed, causing them to be re-discovered and re-queued indefinitely.

Key changes:

  • crates/storage/src/store.rs: Extracts prune_old_states/prune_old_blocks from update_checkpoints into a new prune_old_data() public method. update_checkpoints now only performs cheap, fork-choice-critical pruning (live chain index, gossip signatures, attestation data) when finalization advances.
  • crates/blockchain/src/lib.rs: Calls store.prune_old_data() exactly once after the on_block cascade loop exits, ensuring all intermediate states survive until every pending child has been processed.
  • Tests: fallback_pruning_* tests updated to call prune_old_data() explicitly, correctly reflecting the new API contract.

One gap remains: the propose_block path calls process_block directly, bypassing BlockChainServer::on_block, so prune_old_data() is never triggered for locally-proposed blocks. In a multi-client devnet this is typically harmless (other validators' blocks arrive via gossip and trigger pruning), but it is a latent storage-growth risk in degraded or single-validator scenarios.

Confidence Score: 4/5

  • The core fix is correct and well-motivated; safe to merge with the propose_block pruning gap noted.
  • The refactor is logically sound: deferring prune_old_states/prune_old_blocks to after the cascade directly eliminates the infinite re-processing loop. The lightweight pruning left in update_checkpoints (live chain, signatures, attestation data) is correctly identified as fork-choice-critical. Tests are updated and the PR description gives thorough cross-client validation. The one deduction is for the propose_block path that bypasses prune_old_data(), which is a real but low-severity storage-growth gap in edge cases.
  • crates/blockchain/src/lib.rs — specifically the propose_block function which does not call prune_old_data() after locally processing a proposed block.

Important Files Changed

Filename Overview
crates/blockchain/src/lib.rs Adds store.prune_old_data() call after the on_block cascade loop — correctly deferring heavy pruning. However, the propose_blockprocess_block path does not pass through on_block and therefore never triggers prune_old_data(), leaving a pruning gap for locally-produced blocks.
crates/storage/src/store.rs Correctly extracts prune_old_data() from update_checkpoints, leaving only lightweight pruning (live chain, gossip signatures, attestation data) inline. New method is well-documented, uses the same protected roots as the old fallback path, and tests are updated to call it explicitly.

Sequence Diagram

sequenceDiagram
    participant Net as Network/Gossip
    participant OB as on_block (BlockChainServer)
    participant PoPB as process_or_pend_block
    participant PB as process_block
    participant SOB as store::on_block (on_block_core)
    participant UC as update_checkpoints
    participant POD as prune_old_data

    Net->>OB: NewBlock message
    OB->>OB: push block onto queue
    loop cascade — while queue not empty
        OB->>PoPB: pop block
        PoPB->>PB: parent state exists → process_block
        PB->>SOB: store::on_block
        SOB->>UC: update_checkpoints
        UC-->>UC: write head/justified/finalized metadata
        UC-->>UC: prune_live_chain (if finalization advances)
        UC-->>UC: prune_gossip_signatures (if finalization advances)
        UC-->>UC: prune_attestation_data_by_root (if finalization advances)
        Note over UC: NO state/block pruning mid-cascade
        SOB-->>PoPB: Ok
        PoPB->>OB: collect_pending_children → push to queue
    end
    OB->>POD: prune_old_data() — runs ONCE after cascade
    POD-->>POD: prune_old_states (protected: finalized + justified)
    POD-->>POD: prune_old_blocks (protected: finalized + justified)

    Note over OB,POD: propose_block path goes directly to process_block,<br/>bypassing on_block — prune_old_data() is never called
Loading

Comments Outside Diff (1)

  1. crates/blockchain/src/lib.rs, line 278-306 (link)

    P1 propose_block bypasses prune_old_data()

    propose_block calls self.process_block(...) directly, which goes through store::on_blockon_block_coreupdate_checkpoints, but does not go through BlockChainServer::on_block. This means prune_old_data() is never called for locally-proposed blocks.

    In normal multi-client devnet conditions this is tolerable because blocks from other validators arrive frequently via gossip and each triggers BlockChainServer::on_blockprune_old_data(). However, in a degraded-network scenario — e.g. a validator-heavy setup where finalization stalls and the node mostly produces its own blocks without receiving many external ones — storage could grow significantly longer than intended between pruning runs.

    Consider calling self.store.prune_old_data() at the end of propose_block (or restructuring the call path so it always passes through the outer on_block):

    fn propose_block(&mut self, slot: u64, validator_id: u64) {
        // ... existing block build / sign / publish logic ...
    
        // Prune here so locally-proposed blocks don't bypass the pruning path.
        self.store.prune_old_data();
    }
    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: crates/blockchain/src/lib.rs
    Line: 278-306
    
    Comment:
    **`propose_block` bypasses `prune_old_data()`**
    
    `propose_block` calls `self.process_block(...)` directly, which goes through `store::on_block``on_block_core``update_checkpoints`, but does **not** go through `BlockChainServer::on_block`. This means `prune_old_data()` is never called for locally-proposed blocks.
    
    In normal multi-client devnet conditions this is tolerable because blocks from other validators arrive frequently via gossip and each triggers `BlockChainServer::on_block``prune_old_data()`. However, in a degraded-network scenario — e.g. a validator-heavy setup where finalization stalls and the node mostly produces its own blocks without receiving many external ones — storage could grow significantly longer than intended between pruning runs.
    
    Consider calling `self.store.prune_old_data()` at the end of `propose_block` (or restructuring the call path so it always passes through the outer `on_block`):
    
    ```rust
    fn propose_block(&mut self, slot: u64, validator_id: u64) {
        // ... existing block build / sign / publish logic ...
    
        // Prune here so locally-proposed blocks don't bypass the pruning path.
        self.store.prune_old_data();
    }
    ```
    
    How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: crates/blockchain/src/lib.rs
Line: 278-306

Comment:
**`propose_block` bypasses `prune_old_data()`**

`propose_block` calls `self.process_block(...)` directly, which goes through `store::on_block``on_block_core``update_checkpoints`, but does **not** go through `BlockChainServer::on_block`. This means `prune_old_data()` is never called for locally-proposed blocks.

In normal multi-client devnet conditions this is tolerable because blocks from other validators arrive frequently via gossip and each triggers `BlockChainServer::on_block``prune_old_data()`. However, in a degraded-network scenario — e.g. a validator-heavy setup where finalization stalls and the node mostly produces its own blocks without receiving many external ones — storage could grow significantly longer than intended between pruning runs.

Consider calling `self.store.prune_old_data()` at the end of `propose_block` (or restructuring the call path so it always passes through the outer `on_block`):

```rust
fn propose_block(&mut self, slot: u64, validator_id: u64) {
    // ... existing block build / sign / publish logic ...

    // Prune here so locally-proposed blocks don't bypass the pruning path.
    self.store.prune_old_data();
}
```

How can I resolve this? If you propose a fix, please make it concise.

Last reviewed commit: a3060ad

@pablodeymo pablodeymo force-pushed the prune-at-after-sync branch from a3060ad to 22b13dc Compare March 17, 2026 20:24
@pablodeymo
Copy link
Collaborator Author

🤖 Claude Code Review

Now I have enough context for a full review.

PR Review: Defer state/block pruning until after block cascade completes

Summary: This is a targeted, well-diagnosed fix for a real production bug. The root cause analysis in the PR description is excellent, and the change is minimal and focused. The fix correctly identifies that mid-cascade pruning can invalidate state that pending children still need.

Correctness

The fix is correct. The core issue — prune_old_states deleting states for recently-processed blocks before their children can use them — is cleanly addressed by moving heavy pruning outside the cascade loop.

The split in update_checkpoints is well-considered:

* `prune_live_chain` must remain immediate (fork choice traversal depends on it).

* `prune_gossip_signatures` / `prune_attestation_data_by_root` are cheap and correctness-relevant, so keeping them immediate is fine.

* `prune_old_states` / `prune_old_blocks` are the dangerous ones during cascades and are correctly deferred.

Minor Observations

Tick path no longer triggers state/block pruning.

on_tick calls update_checkpoints(ForkCheckpoints::head_only(...)), which means checkpoints.finalized is None, so the lightweight pruning block inside update_checkpoints doesn't trigger either. And since prune_old_data() is only called from on_block, ticks never drive state/block pruning. This is fine in practice — states only accumulate when blocks arrive — but worth being aware of: if there's ever a tick-only path that produces states, it would bypass pruning entirely.

prune_old_data() runs unconditionally after every cascade (lib.rs:306), even during normal finalization when storage is within retention limits. This is harmless because both prune_old_states and prune_old_blocks have entries.len() <= STATES_TO_KEEP / BLOCKS_TO_KEEP early-return guards, but it does mean two full BlockHeaders table scans per block under normal conditions. This is a pre-existing inefficiency (the two functions scan the same table independently), not introduced here, but worth noting for future optimization.

Observability regression (minor): The old code distinguished "Pruned finalized data" (with pruned_states, pruned_blocks fields) from "Fallback pruning (finalization stalled)". The new prune_old_data() emits "Pruned old states and blocks" for both cases with no indication of why pruning ran. During a devnet incident, this distinction is useful for diagnosing whether finalization is healthy. Consider adding a field like finalization_advancing: bool or a separate log line.

Memory growth during large cascades: The PR description acknowledges this — states for the full cascade depth are held until the cascade completes. For the devnet4 scenario (~10 slots), this is negligible. In theory a very deep cascade (thousands of slots, as in the chain walk scenario) could hold thousands of states temporarily. This is bounded and correct, just worth monitoring in practice.

Tests

The test updates are correct. The new assertion at store.rs:1483 that verifies update_checkpoints alone does not prune (before prune_old_data() is called) is a good addition that directly documents and guards the new contract.

Summary

The change is sound. The fix directly addresses the infinite-loop root cause without unnecessary refactoring. The PR description is thorough and the cross-client comparison adds confidence. The main things worth tracking operationally are temporary memory growth during large cascades and the minor observability loss in the pruning log message.

Automated review by Claude (Anthropic) · custom prompt

🤖 Codex Code Review

1. **High – pruning can be skipped for locally produced blocks, allowing unbounded DB growth (availability risk).**
   Heavy pruning is now only invoked in network block handling (`on_block`), not in the local proposal path that directly calls `process_block`.
   
   * Local proposal path processes block but does not prune: [crates/blockchain/src/lib.rs:263](/home/runner/work/ethlambda/ethlambda/crates/blockchain/src/lib.rs:263)
   * Deferred pruning runs only in `on_block`: [crates/blockchain/src/lib.rs:303](/home/runner/work/ethlambda/ethlambda/crates/blockchain/src/lib.rs:303)
   * `update_checkpoints` no longer does fallback state/block pruning: [crates/storage/src/store.rs:473](/home/runner/work/ethlambda/ethlambda/crates/storage/src/store.rs:473), [crates/storage/src/store.rs:502](/home/runner/work/ethlambda/ethlambda/crates/storage/src/store.rs:502)
   
   If this node does not receive its own published blocks back from gossip, `prune_old_data()` may never run on proposer-driven progress, so `States`/`Block*` can grow without bound under stalled finalization.
   **Suggested fix:** call `self.store.prune_old_data()` after successful local `process_block` in the proposal flow, or centralize pruning in the block-processing core path that runs for both local and network imports.

Overall, the PR’s core intent (avoiding mid-cascade pruning loops) is sound, and I didn’t spot direct fork-choice/justification/finalization logic regressions in this diff beyond the pruning coverage gap above.

For locally produced blocks there's no cascade (single block in the queue), so prune_old_data() runs immediately after processing — same timing as the old inline
pruning in update_checkpoints. The behavior is identical to before this change.

The fix targets the cascade path: when a gossip block triggers pending block resolution, the while loop can process many blocks before exiting. Previously, each
iteration pruned states that later iterations still needed. Now pruning waits for the loop to finish. For single-block processing (local or gossip with no
pending children), "after the cascade" = "after the one block" = no difference.

Copy link
Collaborator

@MegaRedHand MegaRedHand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@pablodeymo pablodeymo merged commit 455611c into main Mar 17, 2026
2 checks passed
@pablodeymo pablodeymo deleted the prune-at-after-sync branch March 17, 2026 21:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants