Add safety-net pruning to prevent OOM when finalization is stalled by pablodeymo · Pull Request #175 · lambdaclass/ethlambda

pablodeymo · 2026-03-02T19:09:58Z

Motivation

When the chain runs for an extended period without finalization (e.g., due to insufficient aggregators or network issues), all pruning is effectively disabled — every prune function gates on finalized_slot advancing. The States table has no pruning at all, and each state is 100+ MB (contains historical_block_hashes up to 8 MiB, justifications_roots up to 8 MiB, validators ~245 KB).

After ~12 hours without finalization: 10,800 states × 100+ MB = potential terabytes of data, causing OOM.

Closes #166
Relates to #103

Description

Adds a safety-net pruning mechanism that activates only when finalization is stalled, using a 1024-slot window (~68 minutes of chain history).

Cutoff calculation

cutoff_slot = max(finalized_slot, head_slot.saturating_sub(1024))

Finalization healthy → cutoff_slot == finalized_slot → no-op (existing finalization-triggered pruning already handles it)
Finalization stalled → cutoff_slot == head_slot - 1024 → prunes old unfinalized data

Protected roots (never pruned)

head root
latest_finalized root
latest_justified root
safe_target root

What gets pruned

All prunable tables, using cutoff_slot instead of finalized_slot:

Table	Status before this PR
States	No pruning at all — primary OOM vector
BlockHeaders / BlockBodies / BlockSignatures	No pruning at all — accumulate indefinitely
LiveChain	Already pruned on finalization, extended to cutoff
GossipSignatures	Already pruned on finalization, extended to cutoff
AttestationDataByRoot	Already pruned on finalization, extended to cutoff
LatestNewAggregatedPayloads	Already pruned on finalization, extended to cutoff
LatestKnownAggregatedPayloads	Already pruned on finalization, extended to cutoff

When it runs

Once per slot at interval 0 in BlockChainServer::on_tick, after tick processing (attestation acceptance) but before block proposal.

New methods in `Store`

safety_net_prune() — Public entry point. Computes cutoff, builds protected set, calls individual prune methods, logs summary at info level.
prune_states(cutoff_slot, protected_roots) — Iterates BlockHeaders (small, ~100 bytes each) to find slots, deletes matching States entries.
prune_old_blocks(cutoff_slot, protected_roots) — Deletes from BlockHeaders, BlockBodies, and BlockSignatures for non-protected old blocks.

Files changed

File	Change
`crates/storage/src/store.rs`	+`MAX_UNFINALIZED_SLOTS` constant, +`safety_net_prune()`, +`prune_states()`, +`prune_old_blocks()`
`crates/blockchain/src/lib.rs`	Call `self.store.safety_net_prune()` at interval 0 in `on_tick`

How to test

Safety-net pruning only activates when head_slot - finalized_slot > 1024, which doesn't occur in any test fixture. All existing tests pass unchanged:

make fmt    # clean
make lint   # clean (clippy -D warnings)
make test   # all 62 tests pass (11 unit + 26 forkchoice + 8 signature + 14 STF + 3 storage)

To test the actual pruning behavior in a live environment:

Start a devnet without the --is-aggregator flag (finalization will stall)
Let it run for >1024 slots (~68 minutes)
Observe Safety-net pruning: finalization stalled log messages with pruned counts
Verify the node does not OOM (previously it would accumulate unbounded state data)

When the chain runs without finalization (e.g., insufficient aggregators), all pruning is disabled since every prune function gates on finalized_slot advancing. The States table has no pruning at all, and each state is 100+ MB. After ~12 hours without finalization this can reach terabytes of data. Add a safety-net that computes cutoff_slot = max(finalized_slot, head_slot - 1024) and prunes states, blocks, live chain, signatures, attestation data, and aggregated payloads older than the cutoff. Protected roots (head, finalized, justified, safe_target) are never pruned. When finalization is healthy, cutoff equals finalized_slot and this is a no-op. Runs once per slot at interval 0 in on_tick, after tick processing but before block proposal.

github-actions · 2026-03-02T19:10:32Z

🤖 Kimi Code Review

Review Summary

The safety-net pruning implementation is a solid defensive mechanism against OOM when finalization stalls. However, several issues need attention:

Critical Issues

Race condition in prune_states and prune_old_blocks (lines 1027-1081 and 1108-1162):
- Both functions iterate over BlockHeaders while potentially deleting from other tables, but they don't ensure atomicity between the read and write phases
- A block could be added between the read and write, causing the pruning to delete data for a block that should be protected
- Fix: Use a single transaction for both read and write operations
Missing pruning functions (lines 995-1003):
- prune_live_chain, prune_gossip_signatures, prune_attestation_data_by_root, and prune_aggregated_payload_table are called but not implemented
- This will cause compilation failures
Potential panic in H256::from_ssz_bytes (lines 1042, 1085, 1125):
- Using expect("valid root") assumes all keys are valid 32-byte roots, but the database might contain corrupted data
- Fix: Use proper error handling instead of expect

Security & Consensus Issues

Protected roots calculation (lines 982-990):
- The safe_target() might not be the best choice for protection - consider if this aligns with fork choice safety requirements
- Missing protection for blocks in the current justified chain that aren't yet finalized
Slot boundary handling (line 976):
- Using <= cutoff_slot is correct, but ensure this doesn't accidentally prune the genesis block (slot 0) if it becomes relevant

Performance & Implementation Issues

Inefficient iteration pattern (lines 1030-1044, 1113-1127):
- Both prune_states and prune_old_blocks iterate the entire BlockHeaders table
- Consider adding a slot-based index or using a more targeted approach
Batch operation inefficiency (lines 1071-1079, 1154-1162):
- Creating separate batches for each table type could be more efficient as a single batch
- The clone operations on keys_to_delete are unnecessary overhead

Minor Issues

Warning suppression (line 1040):
- The warning for failed header decoding might hide important corruption issues
- Consider logging the actual error or the problematic key
Magic number documentation (line 117):
- MAX_UNFINALIZED_SLOTS = 1024 should be documented in terms of epochs (32 slots/epoch = 32 epochs)

Suggested Fixes

// 1. Use a single transaction for atomicity
let mut batch = self.backend.begin_write().expect("write batch");
let view = batch.read_view().expect("read view");

// 2. Handle potential decode errors gracefully
let root = match H256::from_ssz_bytes(&key_bytes) {
    Ok(r) => r,
    Err(_) => {
        warn!("Invalid root bytes in database");
        continue;
    }
};

// 3. Implement missing pruning functions
fn prune_live_chain(&mut self, cutoff_slot: u64) -> usize {
    // Implementation needed
    0
}

The core concept is sound, but these issues need resolution before merging.

Automated review by Kimi (Moonshot AI) · custom prompt

github-actions · 2026-03-02T19:11:48Z

🤖 Codex Code Review

High: safety-net pruning can invalidate fork-choice inputs during non-finality.
self.prune_live_chain(cutoff_slot) is called in safety-net mode without protecting justified/head roots or required ancestry (crates/storage/src/store.rs:1003, crates/storage/src/store.rs:438).
In prolonged non-finality, latest_justified can lag; pruning LiveChain by slot can drop the justified checkpoint/path that LMD GHOST depends on, causing incorrect or unstable head computation.
High: off-by-one inconsistency leaves dangling LiveChain entries at cutoff_slot.
State/block pruning uses <= cutoff_slot (crates/storage/src/store.rs:1042, crates/storage/src/store.rs:1077), but prune_live_chain removes only < finalized_slot (crates/storage/src/store.rs:449).
With safety-net call-site passing cutoff_slot, entries at exactly cutoff_slot remain in LiveChain while corresponding block/state may be deleted, creating internal inconsistency.
Medium: pruning is heavy and runs synchronously once per slot before proposing.
self.store.safety_net_prune() runs at interval 0 (crates/blockchain/src/lib.rs:171) and does multiple full-table scans, including two scans over BlockHeaders (crates/storage/src/store.rs:1032, crates/storage/src/store.rs:1068).
On large DBs this can delay slot duties (proposal/attestation timing), which is a liveness/perf risk.
Low: panic-on-decode in pruning path can crash node on malformed DB entries.
H256::from_ssz_bytes(&key_bytes).expect("valid root") appears in pruning loops (crates/storage/src/store.rs:1043, crates/storage/src/store.rs:1078).
Even if corruption is rare, this makes a maintenance operation a hard crash path; prefer fallible handling with warning + skip.

Overall direction (adding an OOM safety net) is reasonable, but the current LiveChain pruning semantics look consensus-risky unless bounded by justified/fork-choice requirements and made boundary-consistent.

Automated review by OpenAI Codex · custom prompt

github-actions · 2026-03-02T19:17:56Z

🤖 Claude Code Review

Now I have enough context to write a thorough review. Let me compile my findings.

PR #175 Review: Safety-net pruning to prevent OOM

The PR addresses a real and critical problem — unbounded state accumulation causing OOM. The overall approach (cutoff-based pruning with protected roots) is sound. However, there are several correctness and robustness issues worth addressing.

Correctness Issues

Off-by-one inconsistency between pruning methods

prune_states and prune_old_blocks delete entries where header.slot <= cutoff_slot (inclusive), but prune_live_chain — called with the same cutoff_slot — uses slot < cutoff_slot (exclusive), because it was designed around finalized_slot semantics where the boundary slot is retained.

Result: at exactly cutoff_slot, the LiveChain entries survive while the corresponding BlockHeaders and States are deleted. Fork choice traversal through LiveChain could encounter these zombie entries pointing to blocks/states that no longer exist.

Fix: either change prune_states/prune_old_blocks to use < cutoff_slot (keeping cutoff_slot data), or call prune_live_chain(cutoff_slot + 1) to make it delete up to and including cutoff_slot. Given the intent is to retain 1024 slots, < cutoff_slot is semantically cleaner.

Non-atomic state and block pruning

prune_states commits its write batch, then prune_old_blocks opens a new write batch. If the process crashes between those two commits, the node will have States deleted for blocks that still exist in BlockHeaders. On restart, fork choice would call get_state(block_root) for those blocks and get None, which would likely panic or cause hard errors.

These two should share a single write batch committed together, ideally merged with the single scan noted below.

Performance: Duplicated Full Table Scan

prune_states and prune_old_blocks both do a full iteration over all of BlockHeaders with identical filtering logic (slot <= cutoff_slot && !protected_roots.contains(&root)). This is two separate read transactions and two full scans of what could be a large table.

They should be merged into a single scan that collects keys once and dispatches to a single write batch covering all four tables (States, BlockHeaders, BlockBodies, BlockSignatures). This would also fix the non-atomic issue above.

// Conceptual merge:
fn prune_states_and_blocks(&mut self, cutoff_slot: u64, protected: &HashSet<H256>) -> usize {
    let view = self.backend.begin_read().expect("read view");
    let mut keys_to_delete = vec![];
    for (key_bytes, value_bytes) in view.prefix_iterator(Table::BlockHeaders, &[])
        .expect("iterator").filter_map(|r| r.ok())
    {
        let Some(header) = BlockHeader::from_ssz_bytes(&value_bytes).ok() else { continue; };
        if header.slot < cutoff_slot {  // use < for consistency with prune_live_chain
            match H256::from_ssz_bytes(&key_bytes) {
                Ok(root) if !protected.contains(&root) => keys_to_delete.push(key_bytes.to_vec()),
                _ => continue,
            }
        }
    }
    drop(view);
    let count = keys_to_delete.len();
    if count > 0 {
        let mut batch = self.backend.begin_write().expect("write batch");
        batch.delete_batch(Table::States, keys_to_delete.clone()).expect("delete states");
        batch.delete_batch(Table::BlockHeaders, keys_to_delete.clone()).expect("delete headers");
        batch.delete_batch(Table::BlockBodies, keys_to_delete.clone()).expect("delete bodies");
        batch.delete_batch(Table::BlockSignatures, keys_to_delete).expect("delete sigs");
        batch.commit().expect("commit");
    }
    count
}

Robustness: Panic on Malformed DB Key

In both prune_states and prune_old_blocks:

let root = H256::from_ssz_bytes(&key_bytes).expect("valid root");

If any BlockHeaders key is not exactly 32 bytes (DB corruption, schema change, etc.), this panics the process. The header decode just above it correctly uses .ok() to skip errors, but the key decode does not. This should be changed to:

let Ok(root) = H256::from_ssz_bytes(&key_bytes) else {
    warn!("Skipping malformed key in BlockHeaders during safety-net pruning");
    continue;
};

Minor: Incomplete Log Condition

The log trigger condition:

if pruned_states > 0 || pruned_blocks > 0 || pruned_chain > 0 {
    info!(..., pruned_sigs, pruned_att_data, "Safety-net pruning: finalization stalled");
}

pruned_sigs and pruned_att_data are included in the log fields but not in the trigger condition. If only signatures or attestation data are pruned, no log is emitted. Also, prune_aggregated_payload_table returns nothing and contributes no count to the log. For observability, the condition should include all tables or at minimum be documented as intentional.

What Looks Good

The cutoff formula finalized_slot.max(head_slot.saturating_sub(MAX_UNFINALIZED_SLOTS)) is correct — no underflow, correct no-op behavior when finalization is healthy.
The early-return if cutoff_slot <= finalized_slot correctly avoids redundant work in the healthy case.
Protected-roots set (head, latest_finalized, latest_justified, safe_target) correctly prevents pruning data needed for active consensus.
Placement in on_tick at interval 0, before block proposal, is safe — the head is protected.
States confirmed to be keyed by block root (not state root), so the key reuse between BlockHeaders and States in prune_states is valid.
Three-table block deletion in prune_old_blocks is atomic (single batch, one commit).

Summary of Action Items

Priority	Item
High	Merge `prune_states` + `prune_old_blocks` into a single scan + single atomic write batch
High	Fix off-by-one: align `prune_live_chain` boundary with `prune_states`/`prune_old_blocks` (use `<` throughout, or pass `cutoff_slot + 1` to `prune_live_chain`)
Medium	Replace `.expect("valid root")` on key decode with graceful `continue` + `warn!`
Low	Include all tables in log trigger condition, or add `prune_aggregated_payload_table` count to log

Automated review by Claude (Anthropic) · custom prompt

greptile-apps · 2026-03-02T19:22:33Z

Greptile Summary

This PR implements safety-net pruning to prevent OOM when finalization is stalled by limiting unfinalized data retention to a 1024-slot window (~68 minutes). The implementation adds pruning for States, BlockHeaders, BlockBodies, and BlockSignatures—tables that previously had no pruning mechanism at all.

Key changes:

Computes cutoff_slot = max(finalized_slot, head_slot - 1024) to determine pruning threshold
Protects 4 critical roots: head, latest_finalized, latest_justified, and safe_target
Prunes once per slot at interval 0 in BlockChainServer::on_tick
No-op when finalization is healthy (cutoff equals finalized_slot)

Critical issue identified:
prune_live_chain doesn't respect protected roots, which could break fork choice if justification stalls for >1024 slots. Fork choice starts from the justified checkpoint and requires its LiveChain entry to look up parent relationships. Without this entry, fork choice fails to compute the head.

Minor issue:
Inconsistent error handling between prune_states (warns on decode failure) and prune_old_blocks (silent).

Confidence Score: 3/5

Safe to merge with moderate risk - solves OOM for common finalization stalls, but has edge case bug
The implementation correctly handles the common case (finalization stalled, justification advancing) and includes proper protected root checks for block/state pruning. However, LiveChain pruning doesn't respect protected roots, creating a bug when both finalization and justification stall for >1024 slots. This edge case would break fork choice completely. Given the severity but low likelihood of the edge case, score is 3/5.
Pay close attention to crates/storage/src/store.rs - specifically the prune_live_chain call on line 1003 needs protection for justified/safe checkpoints

Important Files Changed

Filename	Overview
crates/storage/src/store.rs	Adds safety-net pruning methods (`safety_net_prune`, `prune_states`, `prune_old_blocks`) to prevent OOM when finalization stalls. Potential issue: `prune_live_chain` doesn't respect protected roots, which could break fork choice if `justified_slot < cutoff_slot`.
crates/blockchain/src/lib.rs	Calls `safety_net_prune()` once per slot at interval 0 to prevent OOM during stalled finalization.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[BlockChainServer::on_tick<br/>interval=0] --> B[safety_net_prune]
    B --> C{Calculate cutoff_slot<br/>max finalized_slot,<br/>head_slot - 1024}
    C --> D{cutoff_slot <= finalized_slot?}
    D -->|Yes| E[Return early<br/>Finalization healthy]
    D -->|No| F[Build protected_roots set<br/>head, finalized, justified, safe]
    F --> G[prune_states cutoff_slot, protected_roots]
    F --> H[prune_old_blocks cutoff_slot, protected_roots]
    F --> I[prune_live_chain cutoff_slot]
    F --> J[prune_gossip_signatures cutoff_slot]
    F --> K[prune_attestation_data cutoff_slot]
    F --> L[prune_aggregated_payloads cutoff_slot]
    G --> M[Iterate BlockHeaders<br/>Find slot <= cutoff_slot<br/>Not in protected_roots]
    M --> N[Delete from States table]
    H --> O[Iterate BlockHeaders<br/>Find slot <= cutoff_slot<br/>Not in protected_roots]
    O --> P[Delete from BlockHeaders<br/>BlockBodies, BlockSignatures]
    I --> Q[Delete LiveChain entries<br/>slot < cutoff_slot<br/>⚠️ No protected_roots check]
    N --> R[Log pruned counts]
    P --> R
    Q --> R
    J --> R
    K --> R
    L --> R

_{Last reviewed commit: be95754}

greptile-apps

_{2 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-03-02T19:22:37Z

crates/storage/src/store.rs

+
+        let pruned_states = self.prune_states(cutoff_slot, &protected_roots);
+        let pruned_blocks = self.prune_old_blocks(cutoff_slot, &protected_roots);
+        let pruned_chain = self.prune_live_chain(cutoff_slot);


prune_live_chain doesn't respect protected_roots, which could break fork choice if justified_slot < cutoff_slot.

When justified_slot < cutoff_slot (e.g., justified at slot 500, cutoff at 976), the LiveChain entry for the justified checkpoint gets pruned. Fork choice starts from the justified root and requires its LiveChain entry to look up (slot, parent_root) (see fork_choice/src/lib.rs:52-54). Without this entry, blocks.get(&start_root) returns None, causing fork choice to fail.

This scenario occurs when both finalization and justification are stalled for >1024 slots (~68 minutes). While less common than finalization-only stalls, it's possible with severe network issues or insufficient validator participation.

Protected roots should either:

Be checked in prune_live_chain (skip pruning LiveChain entries for protected roots), or

Use a separate cutoff that ensures justified/safe always remain in LiveChain (e.g., max(finalized_slot, justified_slot.saturating_sub(1024)))

Prompt To Fix With AI

This is a comment left during a code review. Path: crates/storage/src/store.rs Line: 1003 Comment: `prune_live_chain` doesn't respect `protected_roots`, which could break fork choice if `justified_slot < cutoff_slot`. When `justified_slot < cutoff_slot` (e.g., justified at slot 500, cutoff at 976), the LiveChain entry for the justified checkpoint gets pruned. Fork choice starts from the justified root and requires its LiveChain entry to look up `(slot, parent_root)` (see `fork_choice/src/lib.rs:52-54`). Without this entry, `blocks.get(&start_root)` returns None, causing fork choice to fail. This scenario occurs when both finalization and justification are stalled for >1024 slots (~68 minutes). While less common than finalization-only stalls, it's possible with severe network issues or insufficient validator participation. Protected roots should either: 1. Be checked in `prune_live_chain` (skip pruning LiveChain entries for protected roots), or 2. Use a separate cutoff that ensures justified/safe always remain in LiveChain (e.g., `max(finalized_slot, justified_slot.saturating_sub(1024))`) How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-03-02T19:22:38Z

crates/storage/src/store.rs

+            let Some(header) = BlockHeader::from_ssz_bytes(&value_bytes).ok() else {
+                continue;
+            };


Silently continues on decode failure, inconsistent with prune_states:1038 which warns. Add logging for consistency:

Suggested change

let Some(header) = BlockHeader::from_ssz_bytes(&value_bytes).ok() else {

continue;

};

let Some(header) = BlockHeader::from_ssz_bytes(&value_bytes).ok() else {

warn!("Failed to decode block header during safety-net pruning");

continue;

};

Prompt To Fix With AI

This is a comment left during a code review. Path: crates/storage/src/store.rs Line: 1073-1075 Comment: Silently continues on decode failure, inconsistent with `prune_states:1038` which warns. Add logging for consistency: ```suggestion let Some(header) = BlockHeader::from_ssz_bytes(&value_bytes).ok() else { warn!("Failed to decode block header during safety-net pruning"); continue; }; ``` How can I resolve this? If you propose a fix, please make it concise.

MegaRedHand · 2026-03-02T20:24:54Z

The safety net should only prune the LiveChain, and I think we should review other solutions before choosing this one

greptile-apps bot reviewed Mar 2, 2026

View reviewed changes

MegaRedHand closed this Mar 2, 2026

MegaRedHand deleted the safety-net-pruning branch March 2, 2026 20:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add safety-net pruning to prevent OOM when finalization is stalled#175

Add safety-net pruning to prevent OOM when finalization is stalled#175
pablodeymo wants to merge 1 commit intomainfrom
safety-net-pruning

pablodeymo commented Mar 2, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 2, 2026

Uh oh!

github-actions bot commented Mar 2, 2026

Uh oh!

github-actions bot commented Mar 2, 2026

Uh oh!

greptile-apps bot commented Mar 2, 2026

Important Files Changed

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot Mar 2, 2026

Uh oh!

greptile-apps bot Mar 2, 2026

Uh oh!

MegaRedHand commented Mar 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pablodeymo commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Description

Cutoff calculation

Protected roots (never pruned)

What gets pruned

When it runs

New methods in Store

Files changed

How to test

Uh oh!

github-actions bot commented Mar 2, 2026

🤖 Kimi Code Review

Review Summary

Critical Issues

Security & Consensus Issues

Performance & Implementation Issues

Minor Issues

Suggested Fixes

Uh oh!

github-actions bot commented Mar 2, 2026

🤖 Codex Code Review

Uh oh!

github-actions bot commented Mar 2, 2026

🤖 Claude Code Review

PR #175 Review: Safety-net pruning to prevent OOM

Correctness Issues

Performance: Duplicated Full Table Scan

Robustness: Panic on Malformed DB Key

Minor: Incomplete Log Condition

What Looks Good

Summary of Action Items

Uh oh!

greptile-apps bot commented Mar 2, 2026

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

MegaRedHand commented Mar 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pablodeymo commented Mar 2, 2026 •

edited

Loading

New methods in `Store`