Add safety-net pruning to prevent OOM when finalization is stalled#175
Add safety-net pruning to prevent OOM when finalization is stalled#175pablodeymo wants to merge 1 commit intomainfrom
Conversation
When the chain runs without finalization (e.g., insufficient aggregators), all pruning is disabled since every prune function gates on finalized_slot advancing. The States table has no pruning at all, and each state is 100+ MB. After ~12 hours without finalization this can reach terabytes of data. Add a safety-net that computes cutoff_slot = max(finalized_slot, head_slot - 1024) and prunes states, blocks, live chain, signatures, attestation data, and aggregated payloads older than the cutoff. Protected roots (head, finalized, justified, safe_target) are never pruned. When finalization is healthy, cutoff equals finalized_slot and this is a no-op. Runs once per slot at interval 0 in on_tick, after tick processing but before block proposal.
🤖 Kimi Code ReviewReview SummaryThe safety-net pruning implementation is a solid defensive mechanism against OOM when finalization stalls. However, several issues need attention: Critical Issues
Security & Consensus Issues
Performance & Implementation Issues
Minor Issues
Suggested Fixes// 1. Use a single transaction for atomicity
let mut batch = self.backend.begin_write().expect("write batch");
let view = batch.read_view().expect("read view");
// 2. Handle potential decode errors gracefully
let root = match H256::from_ssz_bytes(&key_bytes) {
Ok(r) => r,
Err(_) => {
warn!("Invalid root bytes in database");
continue;
}
};
// 3. Implement missing pruning functions
fn prune_live_chain(&mut self, cutoff_slot: u64) -> usize {
// Implementation needed
0
}The core concept is sound, but these issues need resolution before merging. Automated review by Kimi (Moonshot AI) · custom prompt |
🤖 Codex Code Review
Overall direction (adding an OOM safety net) is reasonable, but the current Automated review by OpenAI Codex · custom prompt |
🤖 Claude Code ReviewNow I have enough context to write a thorough review. Let me compile my findings. PR #175 Review: Safety-net pruning to prevent OOMThe PR addresses a real and critical problem — unbounded state accumulation causing OOM. The overall approach (cutoff-based pruning with protected roots) is sound. However, there are several correctness and robustness issues worth addressing. Correctness IssuesOff-by-one inconsistency between pruning methods
Result: at exactly Fix: either change Non-atomic state and block pruning
These two should share a single write batch committed together, ideally merged with the single scan noted below. Performance: Duplicated Full Table Scan
They should be merged into a single scan that collects keys once and dispatches to a single write batch covering all four tables ( // Conceptual merge:
fn prune_states_and_blocks(&mut self, cutoff_slot: u64, protected: &HashSet<H256>) -> usize {
let view = self.backend.begin_read().expect("read view");
let mut keys_to_delete = vec![];
for (key_bytes, value_bytes) in view.prefix_iterator(Table::BlockHeaders, &[])
.expect("iterator").filter_map(|r| r.ok())
{
let Some(header) = BlockHeader::from_ssz_bytes(&value_bytes).ok() else { continue; };
if header.slot < cutoff_slot { // use < for consistency with prune_live_chain
match H256::from_ssz_bytes(&key_bytes) {
Ok(root) if !protected.contains(&root) => keys_to_delete.push(key_bytes.to_vec()),
_ => continue,
}
}
}
drop(view);
let count = keys_to_delete.len();
if count > 0 {
let mut batch = self.backend.begin_write().expect("write batch");
batch.delete_batch(Table::States, keys_to_delete.clone()).expect("delete states");
batch.delete_batch(Table::BlockHeaders, keys_to_delete.clone()).expect("delete headers");
batch.delete_batch(Table::BlockBodies, keys_to_delete.clone()).expect("delete bodies");
batch.delete_batch(Table::BlockSignatures, keys_to_delete).expect("delete sigs");
batch.commit().expect("commit");
}
count
}Robustness: Panic on Malformed DB KeyIn both let root = H256::from_ssz_bytes(&key_bytes).expect("valid root");If any let Ok(root) = H256::from_ssz_bytes(&key_bytes) else {
warn!("Skipping malformed key in BlockHeaders during safety-net pruning");
continue;
};Minor: Incomplete Log ConditionThe log trigger condition: if pruned_states > 0 || pruned_blocks > 0 || pruned_chain > 0 {
info!(..., pruned_sigs, pruned_att_data, "Safety-net pruning: finalization stalled");
}
What Looks Good
Summary of Action Items
Automated review by Claude (Anthropic) · custom prompt |
Greptile SummaryThis PR implements safety-net pruning to prevent OOM when finalization is stalled by limiting unfinalized data retention to a 1024-slot window (~68 minutes). The implementation adds pruning for States, BlockHeaders, BlockBodies, and BlockSignatures—tables that previously had no pruning mechanism at all. Key changes:
Critical issue identified: Minor issue: Confidence Score: 3/5
|
| Filename | Overview |
|---|---|
| crates/storage/src/store.rs | Adds safety-net pruning methods (safety_net_prune, prune_states, prune_old_blocks) to prevent OOM when finalization stalls. Potential issue: prune_live_chain doesn't respect protected roots, which could break fork choice if justified_slot < cutoff_slot. |
| crates/blockchain/src/lib.rs | Calls safety_net_prune() once per slot at interval 0 to prevent OOM during stalled finalization. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[BlockChainServer::on_tick<br/>interval=0] --> B[safety_net_prune]
B --> C{Calculate cutoff_slot<br/>max finalized_slot,<br/>head_slot - 1024}
C --> D{cutoff_slot <= finalized_slot?}
D -->|Yes| E[Return early<br/>Finalization healthy]
D -->|No| F[Build protected_roots set<br/>head, finalized, justified, safe]
F --> G[prune_states cutoff_slot, protected_roots]
F --> H[prune_old_blocks cutoff_slot, protected_roots]
F --> I[prune_live_chain cutoff_slot]
F --> J[prune_gossip_signatures cutoff_slot]
F --> K[prune_attestation_data cutoff_slot]
F --> L[prune_aggregated_payloads cutoff_slot]
G --> M[Iterate BlockHeaders<br/>Find slot <= cutoff_slot<br/>Not in protected_roots]
M --> N[Delete from States table]
H --> O[Iterate BlockHeaders<br/>Find slot <= cutoff_slot<br/>Not in protected_roots]
O --> P[Delete from BlockHeaders<br/>BlockBodies, BlockSignatures]
I --> Q[Delete LiveChain entries<br/>slot < cutoff_slot<br/>⚠️ No protected_roots check]
N --> R[Log pruned counts]
P --> R
Q --> R
J --> R
K --> R
L --> R
Last reviewed commit: be95754
|
|
||
| let pruned_states = self.prune_states(cutoff_slot, &protected_roots); | ||
| let pruned_blocks = self.prune_old_blocks(cutoff_slot, &protected_roots); | ||
| let pruned_chain = self.prune_live_chain(cutoff_slot); |
There was a problem hiding this comment.
prune_live_chain doesn't respect protected_roots, which could break fork choice if justified_slot < cutoff_slot.
When justified_slot < cutoff_slot (e.g., justified at slot 500, cutoff at 976), the LiveChain entry for the justified checkpoint gets pruned. Fork choice starts from the justified root and requires its LiveChain entry to look up (slot, parent_root) (see fork_choice/src/lib.rs:52-54). Without this entry, blocks.get(&start_root) returns None, causing fork choice to fail.
This scenario occurs when both finalization and justification are stalled for >1024 slots (~68 minutes). While less common than finalization-only stalls, it's possible with severe network issues or insufficient validator participation.
Protected roots should either:
- Be checked in
prune_live_chain(skip pruning LiveChain entries for protected roots), or - Use a separate cutoff that ensures justified/safe always remain in LiveChain (e.g.,
max(finalized_slot, justified_slot.saturating_sub(1024)))
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/storage/src/store.rs
Line: 1003
Comment:
`prune_live_chain` doesn't respect `protected_roots`, which could break fork choice if `justified_slot < cutoff_slot`.
When `justified_slot < cutoff_slot` (e.g., justified at slot 500, cutoff at 976), the LiveChain entry for the justified checkpoint gets pruned. Fork choice starts from the justified root and requires its LiveChain entry to look up `(slot, parent_root)` (see `fork_choice/src/lib.rs:52-54`). Without this entry, `blocks.get(&start_root)` returns None, causing fork choice to fail.
This scenario occurs when both finalization and justification are stalled for >1024 slots (~68 minutes). While less common than finalization-only stalls, it's possible with severe network issues or insufficient validator participation.
Protected roots should either:
1. Be checked in `prune_live_chain` (skip pruning LiveChain entries for protected roots), or
2. Use a separate cutoff that ensures justified/safe always remain in LiveChain (e.g., `max(finalized_slot, justified_slot.saturating_sub(1024))`)
How can I resolve this? If you propose a fix, please make it concise.| let Some(header) = BlockHeader::from_ssz_bytes(&value_bytes).ok() else { | ||
| continue; | ||
| }; |
There was a problem hiding this comment.
Silently continues on decode failure, inconsistent with prune_states:1038 which warns. Add logging for consistency:
| let Some(header) = BlockHeader::from_ssz_bytes(&value_bytes).ok() else { | |
| continue; | |
| }; | |
| let Some(header) = BlockHeader::from_ssz_bytes(&value_bytes).ok() else { | |
| warn!("Failed to decode block header during safety-net pruning"); | |
| continue; | |
| }; |
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/storage/src/store.rs
Line: 1073-1075
Comment:
Silently continues on decode failure, inconsistent with `prune_states:1038` which warns. Add logging for consistency:
```suggestion
let Some(header) = BlockHeader::from_ssz_bytes(&value_bytes).ok() else {
warn!("Failed to decode block header during safety-net pruning");
continue;
};
```
How can I resolve this? If you propose a fix, please make it concise.|
The safety net should only prune the |
Motivation
When the chain runs for an extended period without finalization (e.g., due to insufficient aggregators or network issues), all pruning is effectively disabled — every prune function gates on
finalized_slotadvancing. TheStatestable has no pruning at all, and each state is 100+ MB (containshistorical_block_hashesup to 8 MiB,justifications_rootsup to 8 MiB, validators ~245 KB).After ~12 hours without finalization: 10,800 states × 100+ MB = potential terabytes of data, causing OOM.
Closes #166
Relates to #103
Description
Adds a safety-net pruning mechanism that activates only when finalization is stalled, using a 1024-slot window (~68 minutes of chain history).
Cutoff calculation
cutoff_slot == finalized_slot→ no-op (existing finalization-triggered pruning already handles it)cutoff_slot == head_slot - 1024→ prunes old unfinalized dataProtected roots (never pruned)
headrootlatest_finalizedrootlatest_justifiedrootsafe_targetrootWhat gets pruned
All prunable tables, using
cutoff_slotinstead offinalized_slot:When it runs
Once per slot at interval 0 in
BlockChainServer::on_tick, after tick processing (attestation acceptance) but before block proposal.New methods in
Storesafety_net_prune()— Public entry point. Computes cutoff, builds protected set, calls individual prune methods, logs summary at info level.prune_states(cutoff_slot, protected_roots)— IteratesBlockHeaders(small, ~100 bytes each) to find slots, deletes matchingStatesentries.prune_old_blocks(cutoff_slot, protected_roots)— Deletes fromBlockHeaders,BlockBodies, andBlockSignaturesfor non-protected old blocks.Files changed
crates/storage/src/store.rsMAX_UNFINALIZED_SLOTSconstant, +safety_net_prune(), +prune_states(), +prune_old_blocks()crates/blockchain/src/lib.rsself.store.safety_net_prune()at interval 0 inon_tickHow to test
Safety-net pruning only activates when
head_slot - finalized_slot > 1024, which doesn't occur in any test fixture. All existing tests pass unchanged:To test the actual pruning behavior in a live environment:
--is-aggregatorflag (finalization will stall)Safety-net pruning: finalization stalledlog messages with pruned counts