Skip to content

fix: fail fast on cold-start backfill errors#20840

Draft
karlfloersch wants to merge 5 commits into
developfrom
karl/fail-fast-cold-start-backfill
Draft

fix: fail fast on cold-start backfill errors#20840
karlfloersch wants to merge 5 commits into
developfrom
karl/fail-fast-cold-start-backfill

Conversation

@karlfloersch
Copy link
Copy Markdown
Contributor

Summary

This is a narrow follow-up to PR #20823. It keeps the cold-start retry behavior while the supernode is waiting for VNs/SafeDB to become ready, but treats failures from the actual configured log-backfill range as fatal.

Rationale

Once every chain has produced a SafeDB first entry, the supernode has selected the startup handoff and has begun fetching historical logs. If that historical range cannot be served, retrying the same range forever does not make progress and contradicts the intended hard-fail behavior for unavailable backfill history.

Changes

  • Wrap runColdStartBackfill failures in a sentinel errColdStartBackfill.
  • Let waitForColdStartInit return that sentinel instead of retrying indefinitely.
  • Add a unit test proving Start exits when the backfill range cannot be fetched after SafeDB readiness.

Validation

  • go test ./op-supernode/supernode/activity/interop
  • git diff --check

ajsutton and others added 5 commits May 18, 2026 08:45
Replace the EL-finalized-head cold-start heuristic with a deterministic
verifiedDB-resume / SafeDB-first-entry model.

- Resume always wins. Any committed verifiedDB entry resumes at
  LastTimestamp+1 with no SafeDB or chain RPC consultation.
- Cold start (no verifiedDB) waits for every chain to record a first
  SafeDB entry, then sets verificationStartTimestamp =
  max(activationTimestamp, max_c first-safe-head timestamp). Wall-clock
  time is never consulted; chain derivation progress is the only
  authoritative signal relative to activation.
- Backfill lower bound is max(activation, per-chain genesis time,
  verificationStart - depth). Hard fails if any chain cannot serve the
  range. reconcileLogsDBTail runs only during cold-start backfill;
  warm-restart paths rely on DecisionRewind for drift handling.
- Start splits into a fast init plus a stateful main loop. The loop
  drives both cold-start init and progressAndRecord, so Start never
  blocks on multi-day EL sync waits and per-iteration backoff /
  cancellation / observability come for free.
- firstVerifiableTimestamp is now a synchronous accessor backed by
  verifiedDB.FirstTimestamp and verificationStartTimestamp; RPC
  handlers return ErrNotStarted while initialization is in progress.

API surface:
- SafeDBReader.FirstEntry on op-node, exposed through VirtualNode and
  ChainContainer (FirstSafeHeadTimestamp). ChainContainer reports
  ErrSafeDBEmpty during cold-start wait.

Tests:
- New startup_test.go covers fastInit resume / cold-start, the
  advanceColdStartInit state machine, the cold-start backfill no-op
  paths, the per-chain genesis clamp, and ErrNotStarted RPC semantics.
- Deletes log_backfill_test.go and the obsolete EL-finalized-head
  startup tests in interop_test.go — replaced by the above.
FirstVerifiableTimestamp() is now the authoritative handoff marker;
the BackfillEndTimestamp fallback was tied to the pre-rework startup
path and the accessor no longer exists on Interop.
…nsient cold-start errors

- Rename fastInit to tryInitFromVerifiedDB.
- Split runLoop so each iteration runs exactly one of waitForColdStartInit
  or progress and the loop is the only place that sleeps; each step returns
  the duration it wants to wait.
- Inject clock.Clock so tests can drive time deterministically; replace
  the remaining time.Sleep/time.Now/time.Since calls.
- Treat all advanceColdStartInit errors as retryable (logged + errorBackoff).
  Cold start races chain-container startup, so transient signals like
  virtual-node-not-running must not kill the activity; cold start has no
  ErrHistoryUnavailable path to handle.
backfillAttempts/backfillCompleted now reference advanceColdStartInit
instead of the removed runLogBackfill, and InteropTestControl's accessor
list swaps the removed BackfillEndTimestamp for VerificationStartTimestamp.
@ajsutton ajsutton force-pushed the aj/feat/interop-startup-rework branch 2 times, most recently from 01c0632 to aad05dd Compare May 18, 2026 23:03
Base automatically changed from aj/feat/interop-startup-rework to develop May 19, 2026 01:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants