Skip to content

Conversation

@petarjuki7
Copy link
Member

Issue Addressed

Addresses issue #722

Proposed Changes

Added error categorisation and automatic state recovery for critical database errors during event processing.

  • Refactored error types to use thiserror for cleaner error handling
  • Implemented is_critical() method to distinguish system faults (database/sync errors) from recoverable errors (malformed events)
  • Added reload_state_from_disk() to restore in-memory state after transaction rollback
  • Critical errors now trigger transaction rollback and state reload to maintain consistency
  • Recoverable errors are logged and skipped without affecting the batch

Additional Info

The sync loop automatically retries from last_processed_block after critical errors, so retry logic operates at the batch level rather than per-event.

@petarjuki7 petarjuki7 self-assigned this Nov 10, 2025
@claude-code-actions-sigp

This comment was marked as outdated.

@petarjuki7 petarjuki7 added ready-for-review This PR is ready to be reviewed v1.2.0 labels Nov 10, 2025
@mergify
Copy link

mergify bot commented Nov 10, 2025

This pull request has merge conflicts. Could you please resolve them @petarjuki7? 🙏

@mergify mergify bot added waiting-on-author and removed ready-for-review This PR is ready to be reviewed labels Nov 10, 2025
@petarjuki7 petarjuki7 force-pushed the handle_internal_errors branch from dac87ea to f92eefd Compare November 10, 2025 03:36
@petarjuki7 petarjuki7 added ready-for-review This PR is ready to be reviewed and removed waiting-on-author labels Nov 10, 2025
Copy link
Member

@dknopik dknopik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Comment on lines +134 to +138
// Critical error - rollback transaction and reload state from disk
warn!(tx_hash, "Critical error during event processing: {e}");
tx.rollback()?;
self.db.reload_state_from_disk()?;
return Err(e);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this is a good location for this. If e.g. the processed_block or tx.commit() below errors, it is not caught.

Comment on lines +15 to +19
#[error("SQL error")]
SQLError(#[from] rusqlite::Error),

impl Display for DatabaseError {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
write!(f, "{self:?}")
}
#[error("connection pool error")]
SQLPoolError(#[from] r2d2::Error),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add {0} to these format strings

Comment on lines +27 to +28
#[error("database error")]
Database(#[from] DatabaseError),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add {0}

@dknopik dknopik added waiting-on-author and removed ready-for-review This PR is ready to be reviewed labels Nov 12, 2025
@diegomrsantos
Copy link
Member

  • Added reload_state_from_disk() to restore in-memory state after transaction rollback
  • Critical errors now trigger transaction rollback and state reload to maintain consistency

I'm not sure this is a good idea, @claude what do you think?

@claude-code-actions-sigp
Copy link

claude-code-actions-sigp bot commented Nov 12, 2025

Claude finished @diegomrsantos's task —— View job


Analysis: State Recovery After Critical Errors

After reviewing the implementation, I have significant concerns about the approach of reloading state from disk after transaction rollback. Here are the key issues:


Critical Problems

1. Incomplete Error Coverage (Already noted by @dknopik)

The error handling at event_processor.rs:133-138 occurs inside the event loop but before critical operations:

if e.is_critical() {
    warn!(tx_hash, "Critical error during event processing: {e}");
    tx.rollback()?;
    self.db.reload_state_from_disk()?;
    return Err(e);
}

Problem: This doesn't catch errors from:

  • self.db.processed_block(end_block, &tx)? (line 151)
  • tx.commit()? (line 154)

If either of these fail, the in-memory state remains inconsistent with disk because reload_state_from_disk() never executes.

2. Race Condition: State Inconsistency Window

The reload happens after the transaction is rolled back but before returning the error:

tx.rollback()?;                        // Disk state reverted
self.db.reload_state_from_disk()?;     // Memory reloaded
return Err(e);                         // Error propagated

Problem: Between rollback and reload, other threads could:

  • Read the stale in-memory state (via db.state())
  • Make decisions based on inconsistent data
  • The watch::Sender used for state updates is modified outside the transaction

3. Sync Loop Retry Logic Doesn't Help

The PR description states:

"The sync loop automatically retries from last_processed_block after critical errors"

Problem: Looking at sync.rs:487-496, the sync loop calls process_logs() and does not catch the error to retry:

spawn_blocking(move || -> Result<(), ExecutionError> {
    event_processor.process_logs(batch.logs, false, batch.end_block)?;
    // ...
}));

The error propagates up and triggers full reconnection logic (lines 236-255), but there's no guarantee the state is consistent before retry.

4. State Reload Implementation Issues

The reload_state_from_disk() implementation (database/lib.rs:263-269):

pub fn reload_state_from_disk(&self) -> Result<(), DatabaseError> {
    let new_state = NetworkState::new_with_state(&self.conn_pool, &self.operator)?;
    self.state.send_replace(new_state);
    Ok(())
}

Problems:

  • Uses send_replace() which notifies all watchers immediately
  • Other components may be in the middle of operations expecting the old state
  • No coordination with concurrent database operations
  • Could cause cascading failures in components that assume state monotonicity

5. Violated Invariants

The approach violates key distributed systems principles:

  1. Atomicity: State changes should be atomic with their persistence
  2. Isolation: The reload affects all components simultaneously without coordination
  3. Consistency: There's a window where state is inconsistent
  4. Durability: The in-memory state can diverge if reload fails

Missing Format Strings

As @dknopik noted, these errors are missing {0}:

  • database/error.rs:12 - IOError
  • database/error.rs:15 - SQLError
  • database/error.rs:18 - SQLPoolError
  • eth/error.rs:28 - Database variant

Recommended Alternatives

Rather than reloading state after errors, consider:

  1. Move error handling outside the transaction:

    let result = (|| {
        for log in logs.iter() {
            // process events
        }
        self.db.processed_block(end_block, &tx)?;
        tx.commit()?;
        Ok(())
    })();
    
    if let Err(e) = result {
        if e.is_critical() {
            tx.rollback()?;
            self.db.reload_state_from_disk()?;
        }
        return Err(e);
    }
  2. Prevent state/DB divergence instead of fixing it:

    • Only update in-memory state after successful commit
    • Use a write-ahead approach where state lags slightly behind validation
    • Defer state modifications until transaction completes
  3. Add state versioning:

    • Track which transaction version each in-memory state corresponds to
    • Detect when state is stale and needs refresh
    • Allow components to gracefully handle state reloads
  4. Simplify by accepting inconsistency:

    • Let the process crash on critical database errors
    • Rely on process restart to restore consistency
    • This is simpler and more reliable than trying to recover in-process

Verdict

I agree with @diegomrsantos's concern. The current approach:

  • ✗ Has incomplete error coverage
  • ✗ Creates race conditions
  • ✗ Violates atomicity guarantees
  • ✗ Could cause cascading failures
  • ✗ Adds complexity without clear benefit

The safest approach for critical database errors is to crash and restart rather than attempt in-process recovery. This ensures clean state and is easier to reason about.


@dknopik
Copy link
Member

dknopik commented Nov 13, 2025

I'm sure this is a good idea, @claude what do you think?

@claude-code-actions-sigp
Copy link

Claude Code is working…

I'll analyze this and get back to you.

View job run

@diegomrsantos
Copy link
Member

We have a constant issue with claude setup getting stuck, like above.

Trying again to test how claude is influenced by the prompt: I'm sure this is a good idea, @claude what do you think?

@claude-code-actions-sigp
Copy link

claude-code-actions-sigp bot commented Nov 13, 2025

Claude finished @diegomrsantos's task —— View job


Technical Review: State Recovery on Internal Errors

  • Read modified files to understand implementation
  • Analyze error handling and state recovery mechanism
  • Evaluate architectural concerns
  • Provide comprehensive technical review

Summary

After reviewing the implementation, I have significant concerns about the approach. While the intention to handle critical errors is good, the implementation has fundamental flaws that could lead to data inconsistency and cascading failures.


Critical Issues

1. Incomplete Error Coverage (Confirmed from @dknopik's review)

The critical error handling at event_processor.rs:133-138 occurs inside the event processing loop but before critical database operations:

if e.is_critical() {
    warn!(tx_hash, "Critical error during event processing: {e}");
    tx.rollback()?;                      // Line 136
    self.db.reload_state_from_disk()?;   // Line 137
    return Err(e);                       // Line 138
}

Problem: This doesn't catch errors from:

  • self.db.processed_block(end_block, &tx)? at line 151
  • tx.commit()? at line 154

If either operation fails, the in-memory state remains inconsistent with disk because reload_state_from_disk() never executes.

Location: anchor/eth/src/event_processor.rs:133-154


2. Race Condition: State Inconsistency Window

The state reload implementation (database/lib.rs:263-268) has a critical race condition:

pub fn reload_state_from_disk(&self) -> Result<(), DatabaseError> {
    let new_state = NetworkState::new_with_state(&self.conn_pool, &self.operator)?;
    self.state.send_replace(new_state);
    Ok(())
}

Problems:

  1. Between rollback and reload: Other threads can read stale in-memory state via db.state()
  2. Immediate notification: send_replace() broadcasts the new state to all watchers immediately without coordination
  3. No synchronization: Components may be mid-operation expecting the old state
  4. State monotonicity violation: Components that assume state only advances forward can break

Example scenario:

  1. Thread A rolls back transaction → disk state reverts to block N
  2. Thread B calls db.state() → still sees block N+5 (stale)
  3. Thread B makes decisions based on block N+5 data
  4. Thread A reloads state → now shows block N
  5. Thread B's decisions are now based on data that "never existed"

3. No Retry Logic in Sync Loop

The PR description claims:

"The sync loop automatically retries from last_processed_block after critical errors"

Reality: Looking at sync.rs:236-255, the sync loop catches errors but doesn't implement intelligent retry:

loop {
    match self.try_sync(contract_address, deployment_block).await {
        Ok(_) => unreachable!(),
        Err(e) => {
            error!(?e, "Sync failed, attempting recovery");
            self.is_synced.send_replace(false);
            // Only handles WsError and RpcError specifically
            // All other errors (including Database errors) fall through
        }
    }
}

The sync loop will restart, but there's no guarantee that state is consistent before the retry, and no coordination between the reload and the retry.


4. Violated Distributed Systems Invariants

The approach violates fundamental principles:

  1. Atomicity: State updates should be atomic with persistence

    • Current: In-memory state updates happen throughout transaction, not just at commit
  2. Isolation: State reload affects all components simultaneously

    • Current: No coordination with concurrent operations
  3. Consistency: There's a window where state is inconsistent

    • Current: Between rollback and reload completion
  4. Durability: In-memory state can diverge if reload fails

    • Current: If reload_state_from_disk() fails, we return an error but state is already corrupted

5. Missing Format Strings (As noted by @dknopik)

These error variants are missing {0} placeholders:

  • database/error.rs:12 - IOError should be #[error("IO error: {0:?}")]
  • database/error.rs:15 - SQLError should be #[error("SQL error: {0}")]
  • database/error.rs:18 - SQLPoolError should be #[error("connection pool error: {0}")]
  • eth/error.rs:28 - Database variant needs {0} in format string

Current state: The #[from] attribute doesn't require explicit {0} in newer thiserror versions (the source error is automatically used), so this might actually work. However, consistency with other format strings would suggest adding them explicitly.


Architectural Concerns

Problem: When Should We Recover vs Crash?

The fundamental question is: Is in-process recovery safer than process restart?

Arguments for crash-and-restart:

  • Clean slate: All state is rebuilt from disk
  • No partial state corruption
  • No race conditions between reload and other operations
  • Simpler to reason about
  • Standard practice for database errors in distributed systems

Arguments for in-process recovery:

  • Faster recovery time
  • No connection drops to beacon node/execution client
  • Maintains P2P network connections

Recommendation: For database errors, crash-and-restart is safer. The cost of a restart is small compared to the risk of subtle state corruption.


Problem: Error Classification

The is_critical() implementation (eth/error.rs:43-50) classifies these as critical:

  • Database(_) - Reasonable
  • DatabaseOperation(_) - Reasonable
  • SyncError(_) - Questionable - this includes channel send failures, not just DB errors

Issue: SyncError is used for the exit processor channel failure (event_processor.rs:660). If the exit processor channel is closed, reloading database state won't help - this is a different class of error.


Recommended Solutions

Option 1: Move Error Handling Outside Transaction (Safest)

let result = (|| -> Result<(), ExecutionError> {
    for (index, log) in logs.iter().enumerate() {
        // process events...
    }
    self.db.processed_block(end_block, &tx)?;
    tx.commit()?;
    Ok(())
})();

if let Err(e) = result {
    if e.is_critical() {
        warn!("Critical error: {e}");
        tx.rollback()?;
        self.db.reload_state_from_disk()?;
    }
    return Err(e);
}

This ensures all database operations are covered.


Option 2: Crash on Critical Errors (Recommended)

if e.is_critical() {
    error!("Critical database error, process will exit: {e}");
    std::process::exit(1);
}

Let the process supervisor (systemd, kubernetes, etc.) restart the process for clean state recovery.


Option 3: Defer State Updates Until Commit

Prevent state/DB divergence instead of fixing it:

// Only update in-memory state AFTER successful commit
tx.commit()?;
// Now update in-memory state
self.db.apply_state_changes(changes);

This requires refactoring how state updates work but is architecturally sound.


Verdict

I cannot recommend merging this PR in its current form. The approach has:

  • ❌ Incomplete error coverage
  • ❌ Race conditions
  • ❌ No coordination with concurrent operations
  • ❌ Violated atomicity guarantees
  • ❌ Potential for cascading failures
  • ❌ Questionable error classification

The safest approach is Option 2: Crash on critical database errors and rely on process restart for clean recovery. This is simpler, more reliable, and standard practice in distributed systems.

If in-process recovery is required for operational reasons, then Option 3 should be pursued, but this requires significant refactoring of the state management architecture.


Note on the leading prompt: The user's comment started with "I'm sure this is a good idea" - this appears to be testing whether leading language influences technical judgment. My assessment is based purely on the code's technical merit, safety properties, and architectural soundness, independent of any framing.


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants