Skip to content

feat(op-supernode): rework interop activity startup#20823

Merged
ajsutton merged 7 commits into
developfrom
aj/feat/interop-startup-rework
May 19, 2026
Merged

feat(op-supernode): rework interop activity startup#20823
ajsutton merged 7 commits into
developfrom
aj/feat/interop-startup-rework

Conversation

@ajsutton
Copy link
Copy Markdown
Contributor

@ajsutton ajsutton commented May 17, 2026

Replaces the EL-finalized-head cold-start heuristic with a deterministic verifiedDB-resume / SafeDB-first-entry model.

Why

The previous startup logic was a tangle:

  • resolveFirstVerifiableTimestamp used EL finalized head as the cold-start origin — the wrong signal (EL's view of finality, not proof that derivation has run).
  • A retry loop conflated transient EL not-ready with permanent SafeDB gaps.
  • Backfill ran before deciding the verification start timestamp.
  • firstVerifiableTimestamp re-derived its value lazily, sometimes with I/O, from RPC handlers.

What changed

  • Resume always wins. Any committed verifiedDB entry resumes at LastTimestamp+1 without consulting SafeDB or chain RPCs.
  • Cold start waits for every chain to record a first SafeDB entry, then sets verificationStartTimestamp = max(activationTimestamp, max first-safe-head timestamp). Wall-clock time is never consulted.
  • Backfill clamps lower bound to `max(activation, per-chain genesis, verificationStart - depth)`. Hard fails if any chain can't serve the range. `reconcileLogsDBTail` runs only during cold-start backfill.
  • Start splits into a fast init + stateful main loop. The loop drives both cold-start init and `progressAndRecord`, so `Start` never blocks on multi-day EL sync waits. Per-iteration backoff, cancellation, and observability come for free.
  • `firstVerifiableTimestamp` is now a synchronous accessor; RPC handlers return `ErrNotStarted` until init completes.

New API surface:

  • `SafeDBReader.FirstEntry` (op-node) → `VirtualNode.FirstSafeHeadEntry` → `ChainContainer.FirstSafeHeadTimestamp`. Returns `ErrSafeDBEmpty` during cold-start wait.

Tests

  • `startup_test.go` covers `fastInit` resume / cold-start, the `advanceColdStartInit` state machine, cold-start backfill no-op paths, per-chain genesis clamp, RPC `ErrNotStarted` semantics, and the four `reconcileLogsDBTail` edge cases (misaligned activation, offline-reorg recovery, ahead-tip canonical no-op, ahead-tip divergent rewind + catch-up).
  • Deleted `log_backfill_test.go` and the obsolete EL-finalized-head startup tests in `interop_test.go` — replaced by the above.
  • End-to-end resync acceptance tests (post- and pre-activation, with/without EL data wipe) are tracked as follow-up work, not included here.

@ajsutton ajsutton requested a review from a team as a code owner May 17, 2026 22:46
@ajsutton ajsutton force-pushed the aj/feat/interop-startup-rework branch from 418d4f7 to fcbefa3 Compare May 18, 2026 20:12
@ajsutton ajsutton changed the base branch from develop to aj/refactor/safedb-l1-at-safe-head May 18, 2026 20:12
Copy link
Copy Markdown
Contributor

@wwared wwared left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Some test gaps Claude mentioned that seem valid:

  • TestLogBackfill_MisalignedActivation was removed but no equivalent test for "backfill with activation misaligned with block boundary" exists. Seems important to cover.
  • TestLogBackfill_RecoversFromOfflineReorg used to test that reconcileLogsDBTail would recover from a reorg, but no new test covers the reorg-recovery behavior even though reconcileLogsDBTail is still called by backfillChain.
  • TestLogBackfill_TrimsNonCanonicalAheadLogsDBAndCatchesUp and TestLogBackfill_LeavesAheadLogsDBUnchanged deleted and no equivalent re-added. These used to cover the "leave alone if canonical" and "trim if non-canonical" code paths.

(I approved but we don't want to merge this before #20824 gets reviewed and merged into this PR first)

Comment thread op-supernode/supernode/activity/interop/interop_test_access.go Outdated
@wiz-0f98cca50a
Copy link
Copy Markdown

wiz-0f98cca50a Bot commented May 18, 2026

Wiz Scan Summary

Scanner Findings
Vulnerability Finding Vulnerabilities -
Data Finding Sensitive Data -
Secret Finding Secrets -
IaC Misconfiguration IaC Misconfigurations -
SAST Finding SAST Findings 1 Medium
Software Management Finding Software Management Findings -
Total 1 Medium

View scan details in Wiz

To detect these findings earlier in the dev lifecycle, try using Wiz Code VS Code Extension.

Comment thread op-devstack/sysgo/l2_cl_supernode.go
@ajsutton
Copy link
Copy Markdown
Contributor Author

Restored those tests - they can only apply if backfill is interrupted now (vs any restart) but still worth keeping.

Base automatically changed from aj/refactor/safedb-l1-at-safe-head to develop May 18, 2026 22:40
@ajsutton ajsutton enabled auto-merge May 18, 2026 22:48
ajsutton added 7 commits May 19, 2026 08:50
Replace the EL-finalized-head cold-start heuristic with a deterministic
verifiedDB-resume / SafeDB-first-entry model.

- Resume always wins. Any committed verifiedDB entry resumes at
  LastTimestamp+1 with no SafeDB or chain RPC consultation.
- Cold start (no verifiedDB) waits for every chain to record a first
  SafeDB entry, then sets verificationStartTimestamp =
  max(activationTimestamp, max_c first-safe-head timestamp). Wall-clock
  time is never consulted; chain derivation progress is the only
  authoritative signal relative to activation.
- Backfill lower bound is max(activation, per-chain genesis time,
  verificationStart - depth). Hard fails if any chain cannot serve the
  range. reconcileLogsDBTail runs only during cold-start backfill;
  warm-restart paths rely on DecisionRewind for drift handling.
- Start splits into a fast init plus a stateful main loop. The loop
  drives both cold-start init and progressAndRecord, so Start never
  blocks on multi-day EL sync waits and per-iteration backoff /
  cancellation / observability come for free.
- firstVerifiableTimestamp is now a synchronous accessor backed by
  verifiedDB.FirstTimestamp and verificationStartTimestamp; RPC
  handlers return ErrNotStarted while initialization is in progress.

API surface:
- SafeDBReader.FirstEntry on op-node, exposed through VirtualNode and
  ChainContainer (FirstSafeHeadTimestamp). ChainContainer reports
  ErrSafeDBEmpty during cold-start wait.

Tests:
- New startup_test.go covers fastInit resume / cold-start, the
  advanceColdStartInit state machine, the cold-start backfill no-op
  paths, the per-chain genesis clamp, and ErrNotStarted RPC semantics.
- Deletes log_backfill_test.go and the obsolete EL-finalized-head
  startup tests in interop_test.go — replaced by the above.
FirstVerifiableTimestamp() is now the authoritative handoff marker;
the BackfillEndTimestamp fallback was tied to the pre-rework startup
path and the accessor no longer exists on Interop.
…nsient cold-start errors

- Rename fastInit to tryInitFromVerifiedDB.
- Split runLoop so each iteration runs exactly one of waitForColdStartInit
  or progress and the loop is the only place that sleeps; each step returns
  the duration it wants to wait.
- Inject clock.Clock so tests can drive time deterministically; replace
  the remaining time.Sleep/time.Now/time.Since calls.
- Treat all advanceColdStartInit errors as retryable (logged + errorBackoff).
  Cold start races chain-container startup, so transient signals like
  virtual-node-not-running must not kill the activity; cold start has no
  ErrHistoryUnavailable path to handle.
backfillAttempts/backfillCompleted now reference advanceColdStartInit
instead of the removed runLogBackfill, and InteropTestControl's accessor
list swaps the removed BackfillEndTimestamp for VerificationStartTimestamp.
Adds two acceptance tests covering the cold-start resync path introduced
by the interop startup rework: a post-activation case and a
pre-activation case. Both stop the supernode, delete its entire on-disk
data dir, and start a fresh instance against the same chains and
virtual nodes, then assert the verifier resumes at a sane
verificationStartTimestamp and drives forward.

Replaces the prior `RestartInteropActivity(wipeLogsDBs bool)` test
primitive — wiping only the logs DBs is no longer meaningful under the
rework — with `RestartWithFreshDataDir`, which performs a full
supernode Stop + data-dir wipe + Start. The sysgo SuperNode gains a
long-lived tcpproxy so its externally visible RPC URL survives the
restart. The old `TestSupernodeLogBackfill_HappyPath` (and its
backfill/ shared helpers) is removed; its scenario is unreachable under
the rework and the new tests cover the legitimate cold-start path.
Set backfillCompleted in the resume path so the flag reflects "no more
backfill planned" in every branch, and BackfillCompleted() can be a
single atomic read.
…erage

Adds tests for the four cases removed with log_backfill_test.go that are
still load-bearing under the new model — cold-start backfill is the only
path that calls reconcileLogsDBTail, and these exercise its branches:

- MisalignedActivation: floor(activation / blockTime) anchor invariant.
- RecoversFromOfflineReorg: stale tail, reconcile clears, backfill seals.
- LeavesAheadLogsDBUnchanged: tip past endTime + canonical, no-op.
- TrimsNonCanonicalAheadLogsDBAndCatchesUp: tip past endTime + divergent,
  reconcile rewinds then backfill catches up.
@ajsutton ajsutton force-pushed the aj/feat/interop-startup-rework branch from 01c0632 to aad05dd Compare May 18, 2026 23:03
@ajsutton ajsutton added this pull request to the merge queue May 18, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 19, 2026
@ajsutton ajsutton added this pull request to the merge queue May 19, 2026
Merged via the queue into develop with commit 898ce8b May 19, 2026
85 checks passed
@ajsutton ajsutton deleted the aj/feat/interop-startup-rework branch May 19, 2026 01:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants