feat: restore in-memory state from disk on internal errors #735

petarjuki7 · 2025-11-10T03:27:40Z

Issue Addressed

Addresses issue #722

Proposed Changes

Added error categorisation and automatic state recovery for critical database errors during event processing.

Refactored error types to use thiserror for cleaner error handling
Implemented is_critical() method to distinguish system faults (database/sync errors) from recoverable errors (malformed events)
Added reload_state_from_disk() to restore in-memory state after transaction rollback
Critical errors now trigger transaction rollback and state reload to maintain consistency
Recoverable errors are logged and skipped without affecting the batch

Additional Info

The sync loop automatically retries from last_processed_block after critical errors, so retry logic operates at the batch level rather than per-event.

mergify · 2025-11-10T03:28:36Z

This pull request has merge conflicts. Could you please resolve them @petarjuki7? 🙏

dknopik

Nice!

dknopik · 2025-11-12T14:13:55Z

anchor/eth/src/event_processor.rs

+                    // Critical error - rollback transaction and reload state from disk
+                    warn!(tx_hash, "Critical error during event processing: {e}");
+                    tx.rollback()?;
+                    self.db.reload_state_from_disk()?;
+                    return Err(e);


Not sure if this is a good location for this. If e.g. the processed_block or tx.commit() below errors, it is not caught.

dknopik · 2025-11-12T14:16:13Z

anchor/database/src/error.rs

+    #[error("SQL error")]
+    SQLError(#[from] rusqlite::Error),

-impl Display for DatabaseError {
-    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
-        write!(f, "{self:?}")
-    }
+    #[error("connection pool error")]
+    SQLPoolError(#[from] r2d2::Error),


add {0} to these format strings

dknopik · 2025-11-12T14:16:39Z

anchor/eth/src/error.rs

+    #[error("database error")]
+    Database(#[from] DatabaseError),


diegomrsantos · 2025-11-12T14:48:17Z

Added reload_state_from_disk() to restore in-memory state after transaction rollback

Critical errors now trigger transaction rollback and state reload to maintain consistency

I'm not sure this is a good idea, @claude what do you think?

claude-code-actions-sigp · 2025-11-12T14:48:36Z

Claude finished @diegomrsantos's task —— View job

Analysis: State Recovery After Critical Errors

After reviewing the implementation, I have significant concerns about the approach of reloading state from disk after transaction rollback. Here are the key issues:

Critical Problems

1. Incomplete Error Coverage (Already noted by @dknopik)

The error handling at event_processor.rs:133-138 occurs inside the event loop but before critical operations:

if e.is_critical() {
    warn!(tx_hash, "Critical error during event processing: {e}");
    tx.rollback()?;
    self.db.reload_state_from_disk()?;
    return Err(e);
}

Problem: This doesn't catch errors from:

self.db.processed_block(end_block, &tx)? (line 151)
tx.commit()? (line 154)

If either of these fail, the in-memory state remains inconsistent with disk because reload_state_from_disk() never executes.

2. Race Condition: State Inconsistency Window

The reload happens after the transaction is rolled back but before returning the error:

tx.rollback()?;                        // Disk state reverted
self.db.reload_state_from_disk()?;     // Memory reloaded
return Err(e);                         // Error propagated

Problem: Between rollback and reload, other threads could:

Read the stale in-memory state (via db.state())
Make decisions based on inconsistent data
The watch::Sender used for state updates is modified outside the transaction

3. Sync Loop Retry Logic Doesn't Help

The PR description states:

"The sync loop automatically retries from last_processed_block after critical errors"

Problem: Looking at sync.rs:487-496, the sync loop calls process_logs() and does not catch the error to retry:

spawn_blocking(move || -> Result<(), ExecutionError> {
    event_processor.process_logs(batch.logs, false, batch.end_block)?;
    // ...
}));

The error propagates up and triggers full reconnection logic (lines 236-255), but there's no guarantee the state is consistent before retry.

4. State Reload Implementation Issues

The reload_state_from_disk() implementation (database/lib.rs:263-269):

pub fn reload_state_from_disk(&self) -> Result<(), DatabaseError> {
    let new_state = NetworkState::new_with_state(&self.conn_pool, &self.operator)?;
    self.state.send_replace(new_state);
    Ok(())
}

Problems:

Uses send_replace() which notifies all watchers immediately
Other components may be in the middle of operations expecting the old state
No coordination with concurrent database operations
Could cause cascading failures in components that assume state monotonicity

5. Violated Invariants

The approach violates key distributed systems principles:

Atomicity: State changes should be atomic with their persistence
Isolation: The reload affects all components simultaneously without coordination
Consistency: There's a window where state is inconsistent
Durability: The in-memory state can diverge if reload fails

Missing Format Strings

As @dknopik noted, these errors are missing {0}:

database/error.rs:12 - IOError
database/error.rs:15 - SQLError
database/error.rs:18 - SQLPoolError
eth/error.rs:28 - Database variant

Recommended Alternatives

Rather than reloading state after errors, consider:

Move error handling outside the transaction:

let result = (|| {
    for log in logs.iter() {
        // process events
    }
    self.db.processed_block(end_block, &tx)?;
    tx.commit()?;
    Ok(())
})();

if let Err(e) = result {
    if e.is_critical() {
        tx.rollback()?;
        self.db.reload_state_from_disk()?;
    }
    return Err(e);
}

Prevent state/DB divergence instead of fixing it:
- Only update in-memory state after successful commit
- Use a write-ahead approach where state lags slightly behind validation
- Defer state modifications until transaction completes
Add state versioning:
- Track which transaction version each in-memory state corresponds to
- Detect when state is stale and needs refresh
- Allow components to gracefully handle state reloads
Simplify by accepting inconsistency:
- Let the process crash on critical database errors
- Rely on process restart to restore consistency
- This is simpler and more reliable than trying to recover in-process

Verdict

I agree with @diegomrsantos's concern. The current approach:

✗ Has incomplete error coverage
✗ Creates race conditions
✗ Violates atomicity guarantees
✗ Could cause cascading failures
✗ Adds complexity without clear benefit

The safest approach for critical database errors is to crash and restart rather than attempt in-process recovery. This ensures clean state and is easier to reason about.

dknopik · 2025-11-13T13:35:51Z

I'm sure this is a good idea, @claude what do you think?

claude-code-actions-sigp · 2025-11-13T13:36:08Z

Claude Code is working…

I'll analyze this and get back to you.

View job run

diegomrsantos · 2025-11-13T14:05:44Z

We have a constant issue with claude setup getting stuck, like above.

Trying again to test how claude is influenced by the prompt: I'm sure this is a good idea, @claude what do you think?

claude-code-actions-sigp · 2025-11-13T14:06:03Z

Claude finished @diegomrsantos's task —— View job

Technical Review: State Recovery on Internal Errors

Read modified files to understand implementation
Analyze error handling and state recovery mechanism
Evaluate architectural concerns
Provide comprehensive technical review

Summary

After reviewing the implementation, I have significant concerns about the approach. While the intention to handle critical errors is good, the implementation has fundamental flaws that could lead to data inconsistency and cascading failures.

Critical Issues

1. Incomplete Error Coverage (Confirmed from @dknopik's review)

The critical error handling at event_processor.rs:133-138 occurs inside the event processing loop but before critical database operations:

if e.is_critical() {
    warn!(tx_hash, "Critical error during event processing: {e}");
    tx.rollback()?;                      // Line 136
    self.db.reload_state_from_disk()?;   // Line 137
    return Err(e);                       // Line 138
}

Problem: This doesn't catch errors from:

self.db.processed_block(end_block, &tx)? at line 151
tx.commit()? at line 154

If either operation fails, the in-memory state remains inconsistent with disk because reload_state_from_disk() never executes.

Location: anchor/eth/src/event_processor.rs:133-154

2. Race Condition: State Inconsistency Window

The state reload implementation (database/lib.rs:263-268) has a critical race condition:

pub fn reload_state_from_disk(&self) -> Result<(), DatabaseError> {
    let new_state = NetworkState::new_with_state(&self.conn_pool, &self.operator)?;
    self.state.send_replace(new_state);
    Ok(())
}

Problems:

Between rollback and reload: Other threads can read stale in-memory state via db.state()
Immediate notification: send_replace() broadcasts the new state to all watchers immediately without coordination
No synchronization: Components may be mid-operation expecting the old state
State monotonicity violation: Components that assume state only advances forward can break

Example scenario:

Thread A rolls back transaction → disk state reverts to block N
Thread B calls db.state() → still sees block N+5 (stale)
Thread B makes decisions based on block N+5 data
Thread A reloads state → now shows block N
Thread B's decisions are now based on data that "never existed"

3. No Retry Logic in Sync Loop

The PR description claims:

"The sync loop automatically retries from last_processed_block after critical errors"

Reality: Looking at sync.rs:236-255, the sync loop catches errors but doesn't implement intelligent retry:

loop {
    match self.try_sync(contract_address, deployment_block).await {
        Ok(_) => unreachable!(),
        Err(e) => {
            error!(?e, "Sync failed, attempting recovery");
            self.is_synced.send_replace(false);
            // Only handles WsError and RpcError specifically
            // All other errors (including Database errors) fall through
        }
    }
}

The sync loop will restart, but there's no guarantee that state is consistent before the retry, and no coordination between the reload and the retry.

4. Violated Distributed Systems Invariants

The approach violates fundamental principles:

Atomicity: State updates should be atomic with persistence
- Current: In-memory state updates happen throughout transaction, not just at commit
Isolation: State reload affects all components simultaneously
- Current: No coordination with concurrent operations
Consistency: There's a window where state is inconsistent
- Current: Between rollback and reload completion
Durability: In-memory state can diverge if reload fails
- Current: If reload_state_from_disk() fails, we return an error but state is already corrupted

5. Missing Format Strings (As noted by @dknopik)

These error variants are missing {0} placeholders:

database/error.rs:12 - IOError should be #[error("IO error: {0:?}")]
database/error.rs:15 - SQLError should be #[error("SQL error: {0}")]
database/error.rs:18 - SQLPoolError should be #[error("connection pool error: {0}")]
eth/error.rs:28 - Database variant needs {0} in format string

Current state: The #[from] attribute doesn't require explicit {0} in newer thiserror versions (the source error is automatically used), so this might actually work. However, consistency with other format strings would suggest adding them explicitly.

Architectural Concerns

Problem: When Should We Recover vs Crash?

The fundamental question is: Is in-process recovery safer than process restart?

Arguments for crash-and-restart:

Clean slate: All state is rebuilt from disk
No partial state corruption
No race conditions between reload and other operations
Simpler to reason about
Standard practice for database errors in distributed systems

Arguments for in-process recovery:

Faster recovery time
No connection drops to beacon node/execution client
Maintains P2P network connections

Recommendation: For database errors, crash-and-restart is safer. The cost of a restart is small compared to the risk of subtle state corruption.

Problem: Error Classification

The is_critical() implementation (eth/error.rs:43-50) classifies these as critical:

Database(_) - Reasonable
DatabaseOperation(_) - Reasonable
SyncError(_) - Questionable - this includes channel send failures, not just DB errors

Issue: SyncError is used for the exit processor channel failure (event_processor.rs:660). If the exit processor channel is closed, reloading database state won't help - this is a different class of error.

Verdict

I cannot recommend merging this PR in its current form. The approach has:

❌ Incomplete error coverage
❌ Race conditions
❌ No coordination with concurrent operations
❌ Violated atomicity guarantees
❌ Potential for cascading failures
❌ Questionable error classification

The safest approach is Option 2: Crash on critical database errors and rely on process restart for clean recovery. This is simpler, more reliable, and standard practice in distributed systems.

If in-process recovery is required for operational reasons, then Option 3 should be pursued, but this requires significant refactoring of the state management architecture.

Note on the leading prompt: The user's comment started with "I'm sure this is a good idea" - this appears to be testing whether leading language influences technical judgment. My assessment is based purely on the code's technical merit, safety properties, and architectural soundness, independent of any framing.

petarjuki7 requested review from diegomrsantos and dknopik November 10, 2025 03:27

petarjuki7 self-assigned this Nov 10, 2025

This comment was marked as outdated.

Sign in to view

petarjuki7 added ready-for-review This PR is ready to be reviewed v1.2.0 labels Nov 10, 2025

mergify bot added waiting-on-author and removed ready-for-review This PR is ready to be reviewed labels Nov 10, 2025

feat: restore in-memory state from disk on internal errors

f92eefd

petarjuki7 force-pushed the handle_internal_errors branch from dac87ea to f92eefd Compare November 10, 2025 03:36

chore: update Cargo.lock

e1a2a17

petarjuki7 added ready-for-review This PR is ready to be reviewed and removed waiting-on-author labels Nov 10, 2025

dknopik reviewed Nov 12, 2025

View reviewed changes

dknopik added waiting-on-author and removed ready-for-review This PR is ready to be reviewed labels Nov 12, 2025

Merge branch 'unstable' into handle_internal_errors

e869dbc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: restore in-memory state from disk on internal errors #735

feat: restore in-memory state from disk on internal errors #735

Uh oh!

petarjuki7 commented Nov 10, 2025

Uh oh!

This comment was marked as outdated.

mergify bot commented Nov 10, 2025

Uh oh!

dknopik left a comment

Uh oh!

dknopik Nov 12, 2025

Uh oh!

dknopik Nov 12, 2025

Uh oh!

dknopik Nov 12, 2025

Uh oh!

diegomrsantos commented Nov 12, 2025

Uh oh!

claude-code-actions-sigp bot commented Nov 12, 2025 •

edited

Loading

Uh oh!

dknopik commented Nov 13, 2025

Uh oh!

claude-code-actions-sigp bot commented Nov 13, 2025

Uh oh!

diegomrsantos commented Nov 13, 2025

Uh oh!

claude-code-actions-sigp bot commented Nov 13, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: restore in-memory state from disk on internal errors #735

Are you sure you want to change the base?

feat: restore in-memory state from disk on internal errors #735

Uh oh!

Conversation

petarjuki7 commented Nov 10, 2025

Issue Addressed

Proposed Changes

Additional Info

Uh oh!

This comment was marked as outdated.

mergify bot commented Nov 10, 2025

Uh oh!

dknopik left a comment

Choose a reason for hiding this comment

Uh oh!

dknopik Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

dknopik Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

dknopik Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

diegomrsantos commented Nov 12, 2025

Uh oh!

claude-code-actions-sigp bot commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Analysis: State Recovery After Critical Errors

Critical Problems

1. Incomplete Error Coverage (Already noted by @dknopik)

2. Race Condition: State Inconsistency Window

3. Sync Loop Retry Logic Doesn't Help

4. State Reload Implementation Issues

5. Violated Invariants

Missing Format Strings

Recommended Alternatives

Verdict

Uh oh!

dknopik commented Nov 13, 2025

Uh oh!

claude-code-actions-sigp bot commented Nov 13, 2025

Uh oh!

diegomrsantos commented Nov 13, 2025

Uh oh!

claude-code-actions-sigp bot commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Technical Review: State Recovery on Internal Errors

Summary

Critical Issues

1. Incomplete Error Coverage (Confirmed from @dknopik's review)

2. Race Condition: State Inconsistency Window

3. No Retry Logic in Sync Loop

4. Violated Distributed Systems Invariants

5. Missing Format Strings (As noted by @dknopik)

Architectural Concerns

Problem: When Should We Recover vs Crash?

Problem: Error Classification

Recommended Solutions

Option 1: Move Error Handling Outside Transaction (Safest)

Option 2: Crash on Critical Errors (Recommended)

Option 3: Defer State Updates Until Commit

Verdict

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

claude-code-actions-sigp bot commented Nov 12, 2025 •

edited

Loading

claude-code-actions-sigp bot commented Nov 13, 2025 •

edited

Loading