feat(op-supernode): rework interop activity startup#20823
Merged
Conversation
This was referenced May 18, 2026
418d4f7 to
fcbefa3
Compare
wwared
approved these changes
May 18, 2026
Contributor
There was a problem hiding this comment.
LGTM
Some test gaps Claude mentioned that seem valid:
TestLogBackfill_MisalignedActivationwas removed but no equivalent test for "backfill with activation misaligned with block boundary" exists. Seems important to cover.TestLogBackfill_RecoversFromOfflineReorgused to test thatreconcileLogsDBTailwould recover from a reorg, but no new test covers the reorg-recovery behavior even thoughreconcileLogsDBTailis still called bybackfillChain.TestLogBackfill_TrimsNonCanonicalAheadLogsDBAndCatchesUpandTestLogBackfill_LeavesAheadLogsDBUnchangeddeleted and no equivalent re-added. These used to cover the "leave alone if canonical" and "trim if non-canonical" code paths.
(I approved but we don't want to merge this before #20824 gets reviewed and merged into this PR first)
Wiz Scan Summary
To detect these findings earlier in the dev lifecycle, try using Wiz Code VS Code Extension. |
Contributor
Author
|
Restored those tests - they can only apply if backfill is interrupted now (vs any restart) but still worth keeping. |
Replace the EL-finalized-head cold-start heuristic with a deterministic verifiedDB-resume / SafeDB-first-entry model. - Resume always wins. Any committed verifiedDB entry resumes at LastTimestamp+1 with no SafeDB or chain RPC consultation. - Cold start (no verifiedDB) waits for every chain to record a first SafeDB entry, then sets verificationStartTimestamp = max(activationTimestamp, max_c first-safe-head timestamp). Wall-clock time is never consulted; chain derivation progress is the only authoritative signal relative to activation. - Backfill lower bound is max(activation, per-chain genesis time, verificationStart - depth). Hard fails if any chain cannot serve the range. reconcileLogsDBTail runs only during cold-start backfill; warm-restart paths rely on DecisionRewind for drift handling. - Start splits into a fast init plus a stateful main loop. The loop drives both cold-start init and progressAndRecord, so Start never blocks on multi-day EL sync waits and per-iteration backoff / cancellation / observability come for free. - firstVerifiableTimestamp is now a synchronous accessor backed by verifiedDB.FirstTimestamp and verificationStartTimestamp; RPC handlers return ErrNotStarted while initialization is in progress. API surface: - SafeDBReader.FirstEntry on op-node, exposed through VirtualNode and ChainContainer (FirstSafeHeadTimestamp). ChainContainer reports ErrSafeDBEmpty during cold-start wait. Tests: - New startup_test.go covers fastInit resume / cold-start, the advanceColdStartInit state machine, the cold-start backfill no-op paths, the per-chain genesis clamp, and ErrNotStarted RPC semantics. - Deletes log_backfill_test.go and the obsolete EL-finalized-head startup tests in interop_test.go — replaced by the above.
FirstVerifiableTimestamp() is now the authoritative handoff marker; the BackfillEndTimestamp fallback was tied to the pre-rework startup path and the accessor no longer exists on Interop.
…nsient cold-start errors - Rename fastInit to tryInitFromVerifiedDB. - Split runLoop so each iteration runs exactly one of waitForColdStartInit or progress and the loop is the only place that sleeps; each step returns the duration it wants to wait. - Inject clock.Clock so tests can drive time deterministically; replace the remaining time.Sleep/time.Now/time.Since calls. - Treat all advanceColdStartInit errors as retryable (logged + errorBackoff). Cold start races chain-container startup, so transient signals like virtual-node-not-running must not kill the activity; cold start has no ErrHistoryUnavailable path to handle.
backfillAttempts/backfillCompleted now reference advanceColdStartInit instead of the removed runLogBackfill, and InteropTestControl's accessor list swaps the removed BackfillEndTimestamp for VerificationStartTimestamp.
Adds two acceptance tests covering the cold-start resync path introduced by the interop startup rework: a post-activation case and a pre-activation case. Both stop the supernode, delete its entire on-disk data dir, and start a fresh instance against the same chains and virtual nodes, then assert the verifier resumes at a sane verificationStartTimestamp and drives forward. Replaces the prior `RestartInteropActivity(wipeLogsDBs bool)` test primitive — wiping only the logs DBs is no longer meaningful under the rework — with `RestartWithFreshDataDir`, which performs a full supernode Stop + data-dir wipe + Start. The sysgo SuperNode gains a long-lived tcpproxy so its externally visible RPC URL survives the restart. The old `TestSupernodeLogBackfill_HappyPath` (and its backfill/ shared helpers) is removed; its scenario is unreachable under the rework and the new tests cover the legitimate cold-start path.
Set backfillCompleted in the resume path so the flag reflects "no more backfill planned" in every branch, and BackfillCompleted() can be a single atomic read.
…erage Adds tests for the four cases removed with log_backfill_test.go that are still load-bearing under the new model — cold-start backfill is the only path that calls reconcileLogsDBTail, and these exercise its branches: - MisalignedActivation: floor(activation / blockTime) anchor invariant. - RecoversFromOfflineReorg: stale tail, reconcile clears, backfill seals. - LeavesAheadLogsDBUnchanged: tip past endTime + canonical, no-op. - TrimsNonCanonicalAheadLogsDBAndCatchesUp: tip past endTime + divergent, reconcile rewinds then backfill catches up.
01c0632 to
aad05dd
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Replaces the EL-finalized-head cold-start heuristic with a deterministic verifiedDB-resume / SafeDB-first-entry model.
Why
The previous startup logic was a tangle:
resolveFirstVerifiableTimestampused EL finalized head as the cold-start origin — the wrong signal (EL's view of finality, not proof that derivation has run).firstVerifiableTimestampre-derived its value lazily, sometimes with I/O, from RPC handlers.What changed
verifiedDBentry resumes atLastTimestamp+1without consulting SafeDB or chain RPCs.verificationStartTimestamp = max(activationTimestamp, max first-safe-head timestamp). Wall-clock time is never consulted.New API surface:
Tests