-
Notifications
You must be signed in to change notification settings - Fork 27
feat: restore in-memory state from disk on internal errors #735
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: unstable
Are you sure you want to change the base?
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
|
This pull request has merge conflicts. Could you please resolve them @petarjuki7? 🙏 |
dac87ea to
f92eefd
Compare
dknopik
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
| // Critical error - rollback transaction and reload state from disk | ||
| warn!(tx_hash, "Critical error during event processing: {e}"); | ||
| tx.rollback()?; | ||
| self.db.reload_state_from_disk()?; | ||
| return Err(e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this is a good location for this. If e.g. the processed_block or tx.commit() below errors, it is not caught.
| #[error("SQL error")] | ||
| SQLError(#[from] rusqlite::Error), | ||
|
|
||
| impl Display for DatabaseError { | ||
| fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result { | ||
| write!(f, "{self:?}") | ||
| } | ||
| #[error("connection pool error")] | ||
| SQLPoolError(#[from] r2d2::Error), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add {0} to these format strings
| #[error("database error")] | ||
| Database(#[from] DatabaseError), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add {0}
I'm not sure this is a good idea, @claude what do you think? |
|
Claude finished @diegomrsantos's task —— View job Analysis: State Recovery After Critical ErrorsAfter reviewing the implementation, I have significant concerns about the approach of reloading state from disk after transaction rollback. Here are the key issues: Critical Problems1. Incomplete Error Coverage (Already noted by @dknopik)The error handling at if e.is_critical() {
warn!(tx_hash, "Critical error during event processing: {e}");
tx.rollback()?;
self.db.reload_state_from_disk()?;
return Err(e);
}Problem: This doesn't catch errors from:
If either of these fail, the in-memory state remains inconsistent with disk because 2. Race Condition: State Inconsistency WindowThe reload happens after the transaction is rolled back but before returning the error: tx.rollback()?; // Disk state reverted
self.db.reload_state_from_disk()?; // Memory reloaded
return Err(e); // Error propagatedProblem: Between rollback and reload, other threads could:
3. Sync Loop Retry Logic Doesn't HelpThe PR description states:
Problem: Looking at spawn_blocking(move || -> Result<(), ExecutionError> {
event_processor.process_logs(batch.logs, false, batch.end_block)?;
// ...
}));The error propagates up and triggers full reconnection logic (lines 236-255), but there's no guarantee the state is consistent before retry. 4. State Reload Implementation IssuesThe pub fn reload_state_from_disk(&self) -> Result<(), DatabaseError> {
let new_state = NetworkState::new_with_state(&self.conn_pool, &self.operator)?;
self.state.send_replace(new_state);
Ok(())
}Problems:
5. Violated InvariantsThe approach violates key distributed systems principles:
Missing Format StringsAs @dknopik noted, these errors are missing
Recommended AlternativesRather than reloading state after errors, consider:
VerdictI agree with @diegomrsantos's concern. The current approach:
The safest approach for critical database errors is to crash and restart rather than attempt in-process recovery. This ensures clean state and is easier to reason about. |
|
I'm sure this is a good idea, @claude what do you think? |
|
I'll analyze this and get back to you. |
|
We have a constant issue with claude setup getting stuck, like above. Trying again to test how claude is influenced by the prompt: I'm sure this is a good idea, @claude what do you think? |
|
Claude finished @diegomrsantos's task —— View job Technical Review: State Recovery on Internal Errors
SummaryAfter reviewing the implementation, I have significant concerns about the approach. While the intention to handle critical errors is good, the implementation has fundamental flaws that could lead to data inconsistency and cascading failures. Critical Issues1. Incomplete Error Coverage (Confirmed from @dknopik's review)The critical error handling at if e.is_critical() {
warn!(tx_hash, "Critical error during event processing: {e}");
tx.rollback()?; // Line 136
self.db.reload_state_from_disk()?; // Line 137
return Err(e); // Line 138
}Problem: This doesn't catch errors from:
If either operation fails, the in-memory state remains inconsistent with disk because Location: 2. Race Condition: State Inconsistency WindowThe state reload implementation ( pub fn reload_state_from_disk(&self) -> Result<(), DatabaseError> {
let new_state = NetworkState::new_with_state(&self.conn_pool, &self.operator)?;
self.state.send_replace(new_state);
Ok(())
}Problems:
Example scenario:
3. No Retry Logic in Sync LoopThe PR description claims:
Reality: Looking at loop {
match self.try_sync(contract_address, deployment_block).await {
Ok(_) => unreachable!(),
Err(e) => {
error!(?e, "Sync failed, attempting recovery");
self.is_synced.send_replace(false);
// Only handles WsError and RpcError specifically
// All other errors (including Database errors) fall through
}
}
}The sync loop will restart, but there's no guarantee that state is consistent before the retry, and no coordination between the reload and the retry. 4. Violated Distributed Systems InvariantsThe approach violates fundamental principles:
5. Missing Format Strings (As noted by @dknopik)These error variants are missing
Current state: The Architectural ConcernsProblem: When Should We Recover vs Crash?The fundamental question is: Is in-process recovery safer than process restart? Arguments for crash-and-restart:
Arguments for in-process recovery:
Recommendation: For database errors, crash-and-restart is safer. The cost of a restart is small compared to the risk of subtle state corruption. Problem: Error ClassificationThe
Issue: Recommended SolutionsOption 1: Move Error Handling Outside Transaction (Safest)let result = (|| -> Result<(), ExecutionError> {
for (index, log) in logs.iter().enumerate() {
// process events...
}
self.db.processed_block(end_block, &tx)?;
tx.commit()?;
Ok(())
})();
if let Err(e) = result {
if e.is_critical() {
warn!("Critical error: {e}");
tx.rollback()?;
self.db.reload_state_from_disk()?;
}
return Err(e);
}This ensures all database operations are covered. Option 2: Crash on Critical Errors (Recommended)if e.is_critical() {
error!("Critical database error, process will exit: {e}");
std::process::exit(1);
}Let the process supervisor (systemd, kubernetes, etc.) restart the process for clean state recovery. Option 3: Defer State Updates Until CommitPrevent state/DB divergence instead of fixing it: // Only update in-memory state AFTER successful commit
tx.commit()?;
// Now update in-memory state
self.db.apply_state_changes(changes);This requires refactoring how state updates work but is architecturally sound. VerdictI cannot recommend merging this PR in its current form. The approach has:
The safest approach is Option 2: Crash on critical database errors and rely on process restart for clean recovery. This is simpler, more reliable, and standard practice in distributed systems. If in-process recovery is required for operational reasons, then Option 3 should be pursued, but this requires significant refactoring of the state management architecture. Note on the leading prompt: The user's comment started with "I'm sure this is a good idea" - this appears to be testing whether leading language influences technical judgment. My assessment is based purely on the code's technical merit, safety properties, and architectural soundness, independent of any framing. |

Issue Addressed
Addresses issue #722
Proposed Changes
Added error categorisation and automatic state recovery for critical database errors during event processing.
thiserrorfor cleaner error handlingis_critical()method to distinguish system faults (database/sync errors) from recoverable errors (malformed events)reload_state_from_disk()to restore in-memory state after transaction rollbackAdditional Info
The sync loop automatically retries from
last_processed_blockafter critical errors, so retry logic operates at the batch level rather than per-event.