Skip to content

test(op-acceptance-tests): interop startup-rework resync tests with EL wipe#20827

Closed
ajsutton wants to merge 25 commits into
developfrom
aj/test-interop-startup-rework-cl-el-reset
Closed

test(op-acceptance-tests): interop startup-rework resync tests with EL wipe#20827
ajsutton wants to merge 25 commits into
developfrom
aj/test-interop-startup-rework-cl-el-reset

Conversation

@ajsutton
Copy link
Copy Markdown
Contributor

@ajsutton ajsutton commented May 18, 2026

Adds tests 2 and 4 from the interop startup-rework acceptance plan, folded into the post- and pre-activation tests as EL data wiped subtests via t.Run. The fix the EL-wiped path was originally written to exercise has been split out to #20855; these tests cover the resync mechanics on their own.

TwoL2SupernodeInteropPeerEL extends TwoL2SupernodeInterop the same way TwoL2SupernodeFollowL2 does. The supernode VN runs in Sync.SyncMode = ELSync, reqresp + discovery disabled; block content reaches the supernode-fronted EL over execution-layer devp2p from a sibling sequencer EL, CL pubsub only carries head announcements.

Peer tracking moved from DSL into sysgo: OpReth, OpGeth, and SuperNode embed a peerRegistry, connectL2ELPeersBidi/connectL2CLPeers register a re-dial closure on both participants, and each node replays its registry as the final step of Start/startLocked. Restart-after-wipe now uses the same dial code as initial setup (bidi + trusted for ELs), instead of the previous one-way+untrusted DSL replay.

op-reth state wipe: new stack.OnDiskStateWiper interface, implemented on *OpReth (RemoveAll data + proof-history dirs, re-run op-reth init + op-reth proofs init via a shared initStorage()). No-op on in-memory op-geth. Called between Stop and Start by Supernode.RestartWithFreshDataDirAndELs.

-count=5 of EL-wiped subtests, locally:

  • op-geth: PostActivation 5/5 (143–155s), PreActivation 5/5 (30–32s)
  • op-reth: PostActivation 5/5 (148–167s), PreActivation 5/5 (36–49s)

@ajsutton ajsutton requested a review from a team as a code owner May 18, 2026 05:00
@wiz-0f98cca50a
Copy link
Copy Markdown

wiz-0f98cca50a Bot commented May 18, 2026

Wiz Scan Summary

Scanner Findings
Vulnerability Finding Vulnerabilities -
Data Finding Sensitive Data -
Secret Finding Secrets -
IaC Misconfiguration IaC Misconfigurations -
SAST Finding SAST Findings 3 Medium
Software Management Finding Software Management Findings -
Total 3 Medium

View scan details in Wiz

To detect these findings earlier in the dev lifecycle, try using Wiz Code VS Code Extension.

Comment thread op-devstack/sysgo/l2_el_opreth.go
Comment thread op-devstack/sysgo/l2_el_opreth.go Outdated
Comment thread op-devstack/sysgo/l2_el_opreth.go Outdated
@ajsutton ajsutton force-pushed the aj/test-interop-startup-rework-acceptance branch from fee2d4b to 74d5599 Compare May 18, 2026 20:15
@ajsutton ajsutton force-pushed the aj/test-interop-startup-rework-cl-el-reset branch from 025b440 to d0bd226 Compare May 18, 2026 20:16
Base automatically changed from aj/test-interop-startup-rework-acceptance to aj/feat/interop-startup-rework May 18, 2026 21:37
@ajsutton ajsutton force-pushed the aj/test-interop-startup-rework-cl-el-reset branch from d0bd226 to 32f1cc7 Compare May 18, 2026 22:35
Comment on lines +165 to +168
s.l2ACL.ConnectPeer(s.seqCLA)
s.l2BCL.ConnectPeer(s.seqCLB)
s.l2ELA.PeerWith(s.seqELA)
s.l2ELB.PeerWith(s.seqELB)
Copy link
Copy Markdown
Contributor

@wwared wwared May 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit/question: Is there a reason you don't call connectL2ELPeersBidi here? Or duplicate the calls so we connect the peers both ways?

The tests are working so I guess it's unnecessary, but then does that mean we don't even technically need connectL2ELPeersBidi? Does it make a difference to connect A->B + B->A for either the CL or the EL?

@ajsutton ajsutton force-pushed the aj/feat/interop-startup-rework branch from 01c0632 to aad05dd Compare May 18, 2026 23:03
@ajsutton ajsutton force-pushed the aj/test-interop-startup-rework-cl-el-reset branch from 32f1cc7 to fcece54 Compare May 18, 2026 23:03
@ajsutton ajsutton changed the title test(op-acceptance-tests): interop startup-rework resync tests with EL wipe fix(op-supernode): stabilize first SafeDB snapshot in cold-start + add interop resync ATs May 19, 2026
Base automatically changed from aj/feat/interop-startup-rework to develop May 19, 2026 01:48
ajsutton added 17 commits May 19, 2026 12:14
Replace the EL-finalized-head cold-start heuristic with a deterministic
verifiedDB-resume / SafeDB-first-entry model.

- Resume always wins. Any committed verifiedDB entry resumes at
  LastTimestamp+1 with no SafeDB or chain RPC consultation.
- Cold start (no verifiedDB) waits for every chain to record a first
  SafeDB entry, then sets verificationStartTimestamp =
  max(activationTimestamp, max_c first-safe-head timestamp). Wall-clock
  time is never consulted; chain derivation progress is the only
  authoritative signal relative to activation.
- Backfill lower bound is max(activation, per-chain genesis time,
  verificationStart - depth). Hard fails if any chain cannot serve the
  range. reconcileLogsDBTail runs only during cold-start backfill;
  warm-restart paths rely on DecisionRewind for drift handling.
- Start splits into a fast init plus a stateful main loop. The loop
  drives both cold-start init and progressAndRecord, so Start never
  blocks on multi-day EL sync waits and per-iteration backoff /
  cancellation / observability come for free.
- firstVerifiableTimestamp is now a synchronous accessor backed by
  verifiedDB.FirstTimestamp and verificationStartTimestamp; RPC
  handlers return ErrNotStarted while initialization is in progress.

API surface:
- SafeDBReader.FirstEntry on op-node, exposed through VirtualNode and
  ChainContainer (FirstSafeHeadTimestamp). ChainContainer reports
  ErrSafeDBEmpty during cold-start wait.

Tests:
- New startup_test.go covers fastInit resume / cold-start, the
  advanceColdStartInit state machine, the cold-start backfill no-op
  paths, the per-chain genesis clamp, and ErrNotStarted RPC semantics.
- Deletes log_backfill_test.go and the obsolete EL-finalized-head
  startup tests in interop_test.go — replaced by the above.
…nsient cold-start errors

- Rename fastInit to tryInitFromVerifiedDB.
- Split runLoop so each iteration runs exactly one of waitForColdStartInit
  or progress and the loop is the only place that sleeps; each step returns
  the duration it wants to wait.
- Inject clock.Clock so tests can drive time deterministically; replace
  the remaining time.Sleep/time.Now/time.Since calls.
- Treat all advanceColdStartInit errors as retryable (logged + errorBackoff).
  Cold start races chain-container startup, so transient signals like
  virtual-node-not-running must not kill the activity; cold start has no
  ErrHistoryUnavailable path to handle.
backfillAttempts/backfillCompleted now reference advanceColdStartInit
instead of the removed runLogBackfill, and InteropTestControl's accessor
list swaps the removed BackfillEndTimestamp for VerificationStartTimestamp.
Adds two acceptance tests covering the cold-start resync path introduced
by the interop startup rework: a post-activation case and a
pre-activation case. Both stop the supernode, delete its entire on-disk
data dir, and start a fresh instance against the same chains and
virtual nodes, then assert the verifier resumes at a sane
verificationStartTimestamp and drives forward.

Replaces the prior `RestartInteropActivity(wipeLogsDBs bool)` test
primitive — wiping only the logs DBs is no longer meaningful under the
rework — with `RestartWithFreshDataDir`, which performs a full
supernode Stop + data-dir wipe + Start. The sysgo SuperNode gains a
long-lived tcpproxy so its externally visible RPC URL survives the
restart. The old `TestSupernodeLogBackfill_HappyPath` (and its
backfill/ shared helpers) is removed; its scenario is unreachable under
the rework and the new tests cover the legitimate cold-start path.
Set backfillCompleted in the resume path so the flag reflects "no more
backfill planned" in every branch, and BackfillCompleted() can be a
single atomic read.
…erage

Adds tests for the four cases removed with log_backfill_test.go that are
still load-bearing under the new model — cold-start backfill is the only
path that calls reconcileLogsDBTail, and these exercise its branches:

- MisalignedActivation: floor(activation / blockTime) anchor invariant.
- RecoversFromOfflineReorg: stale tail, reconcile clears, backfill seals.
- LeavesAheadLogsDBUnchanged: tip past endTime + canonical, no-op.
- TrimsNonCanonicalAheadLogsDBAndCatchesUp: tip past endTime + divergent,
  reconcile rewinds then backfill catches up.
Adds two acceptance tests covering the cold-start resync path introduced
by the interop startup rework: a post-activation case and a
pre-activation case. Both stop the supernode, delete its entire on-disk
data dir, and start a fresh instance against the same chains and
virtual nodes, then assert the verifier resumes at a sane
verificationStartTimestamp and drives forward.

Replaces the prior `RestartInteropActivity(wipeLogsDBs bool)` test
primitive — wiping only the logs DBs is no longer meaningful under the
rework — with `RestartWithFreshDataDir`, which performs a full
supernode Stop + data-dir wipe + Start. The sysgo SuperNode gains a
long-lived tcpproxy so its externally visible RPC URL survives the
restart. The old `TestSupernodeLogBackfill_HappyPath` (and its
backfill/ shared helpers) is removed; its scenario is unreachable under
the rework and the new tests cover the legitimate cold-start path.
- Delete startup_resync/testutil.go; constants inlined into each test
  file; cross-safe advance check now composes existing
  L2CLNode.AdvancedFn(types.CrossSafe, ...) via dsl.CheckAll.
- Replace pre-restart history wait with stronger signals: the
  post-activation test waits for the supernode to validate a
  post-activation timestamp; the pre-activation test waits for two
  local-safe block advances so the op-node SafeDB has entries the
  cold-start init can consult.
- Drop redundant pre-restart VerificationStartTimestamp >= activation
  checks and the temporary AwaitCrossSafeAdvancesPast wrapper.
- Trim docstrings on the new DSL helpers and sysgo SuperNode methods.
Setup is materially identical with activation at genesis: the
post-restart cold-start init path is the same, since the data-dir wipe
clears the VN safe_db too and post-restart firstSafeDB sits at the
current chain tip in either case. Saves ~20s of pre-restart wall time.
- Post-activation: replace the timestamp-based wait with AdvancedFn on
  CrossSafe — drops the dependency on RollupConfig().BlockTime and lets
  the test express its intent (the verifier is committing) directly.
- Pre-activation: drop AdvanceTime and the post-activation validation
  step. The pre-activation cold-start behaviour is the whole point of
  the test; the verifier transition at activation is covered elsewhere.
  Bump activationDelay to 300s so the chain tip stays comfortably
  pre-activation under CI variance.
We no longer share devnets across packages, so there's no need to keep
the post-activation and pre-activation cases in separate sub-packages.
Both tests now live in a single startup_resync_test.go file.
- Add presets.WithUniformL2BlockTimes (and the sysgo deployer option it
  wraps) so a single literal sets the block time on every configured
  L2 chain.
- Switch the post-activation test to a 1s block time with a 3s backfill
  depth, and wait for L2 finalized to advance several blocks on each
  chain before the restart. On full data-dir wipe op-node drops the
  derivation pipeline back to the finalized head, not L2 genesis, so
  SafeDB's first entry lands well past activation and the cold-start
  backfill window is non-trivial. Without this the backfill range
  always collapses to empty against a re-recorded genesis SafeDB entry.
- AssertBackfillCovers now verifies the backfill repopulated the logs DB.
- Drop WithTimeTravelEnabled from both tests (no longer needed) and
  inline preActivationDelay as 60*60 seconds so the chain stays
  pre-activation through the whole pre-activation test.
- Rename TestSupernodeResyncResumesAtActivation_PreActivation to
  TestSupernodeResyncSchedulesAtActivation_PreActivation; this test
  only covers the scheduled-but-not-active regime.
Split the supernode test-control restart into separate Stop / StartWithFreshDataDir
phases so callers can interleave wipes of sibling components (e.g. an EL data
dir) between supernode teardown and bring-up. Adds dsl.Supernode.RestartWithFreshDataDirAndELs to orchestrate the combined wipe and a tolerant
AwaitVerificationStartsAtOrAfter assertion for cold-start scenarios where the
exact handoff timestamp depends on when the first SafeDB entry is derived.
…quencer

Adds a two-L2 supernode interop preset where, per chain, a sibling op-node +
op-geth pair acts as the canonical sequencer (driving the batcher and proposer)
and the supernode's virtual node runs a separate non-sequencer op-geth, with
both ELs P2P-peered and both CLs P2P-peered.

This lets tests wipe the supernode data dir together with the supernode-fronted
EL while the sibling pair keeps producing blocks. Block transport into the
wiped EL flows through the supernode VN's CL reqresp from the sibling CL plus
engine-API payload insertion — the in-process devstack op-geth does not run a
real execution-layer sync protocol between sibling instances, so true
ELSync-over-devp2p is not used here.
… EL wipe

Adds the EL-also-wiped variants of the existing post-activation and
pre-activation supernode cold-start tests. Each new test uses the new
TwoL2SupernodeInteropPeerEL preset so a sibling sequencer pair keeps producing
blocks while the supernode and its fronted EL are wiped together, then asserts
cold-start init completes and cross-safe advances after the restart.
The previous version of TwoL2SupernodeInteropPeerEL ran the supernode VN in
CL-sync + reqresp so block content reached the wiped EL through CL gossip.
That side-stepped the test's intent and the supernode's production sync mode.

Rewire so the supernode VN runs as a non-sequencer with Sync.SyncMode = ELSync,
reqresp disabled, and discovery off. Block content into the supernode-fronted
EL now flows over execution-layer devp2p from the sibling sequencer EL; CL
pubsub is only used to announce new heads so the VN can issue forkchoice
updates. Both ELs and both CLs are explicitly peered via admin_addPeer-style
helpers at preset setup, and the tests re-peer them after the supernode + EL
wipe (discovery is off and the wipe drops every peer table).

Supporting changes:
  * supernodeVNMode now exposes NonSequencer / SyncMode / DisableReqRespSync
    / NoDiscovery so the peer-EL preset can drive the supernode VN through
    the regular config path.
  * connectL2ELPeersBidi adds admin_addPeer in both directions for the small
    two-node EL topology.
  * Preset exposes SequencerL2A/B EL+CL so tests can address the sibling
    sequencer directly; the supernode-fronted EL becomes the TwoL2 L2ELA/B
    field (the wipe target).
  * Tests call ConnectPeer (CL pubsub) and PeerWith (EL devp2p) after the
    combined wipe, then assert cold-start init completes and cross-safe
    advances. Engine logs show 'Starting EL sync' and 'Finished EL sync'
    after each restart, confirming genuine ELSync.
Folds the four tests into two parameterised functions with EL-intact and
EL-wiped subtests. The preset choice is hidden inside a resyncSystem helper.

Dedups the peer-EL preset against the single-node preset, mirroring the
TwoL2SupernodeFollowL2 pattern: the shared runtime gains a SupernodeVNMode
toggle and a SkipBatcherProposer flag, and the peer-EL constructor calls
addPeerELSiblingSequencer to attach a sibling sequencer pair under
Followers["sequencer"]. The standalone peer-EL runtime file is removed.

Adds an OnDiskStateWiper interface implemented by op-reth so the EL wipe
path is a no-op on in-memory op-geth and a genuine RemoveAll+init on
op-reth. dsl.L2ELNode.WipeOnDiskState wraps it and is called between Stop
and Start by Supernode.RestartWithFreshDataDirAndELs.
@ajsutton ajsutton force-pushed the aj/test-interop-startup-rework-cl-el-reset branch from 77a0e75 to 24f652c Compare May 19, 2026 02:15
ajsutton added 4 commits May 19, 2026 12:58
…timeout to 60

newResyncSystem was a 4-line wrapper called twice. Inline it. CrossSafe advance to +1 typically resolves within seconds of AwaitBackfillCompleted returning, so 240*2s=8min was orders of magnitude too generous; 60*2s=2min still leaves ample margin.
…rage()

Same data-dir init + proofs-init invocations were inlined at first start in startMixedOpRethNode and re-implemented in WipeOnDiskState. Extract to a private method on *OpReth and call from both.
@ajsutton ajsutton force-pushed the aj/test-interop-startup-rework-cl-el-reset branch from befbee9 to 3b23e1b Compare May 19, 2026 02:58
@ajsutton ajsutton changed the title fix(op-supernode): stabilize first SafeDB snapshot in cold-start + add interop resync ATs test(op-acceptance-tests): interop startup-rework resync tests with EL wipe May 19, 2026
ajsutton added 4 commits May 19, 2026 13:55
…start

Peer-relationship tracking lived in DSL: ConnectPeer/PeerWith pushed entries into a managedPeers map on each side, and Start hooks re-dialed them after restart. Initial wiring was bidi+trusted via sysgo's connectL2ELPeersBidi, but restart-replay was one-way+untrusted via DSL — silent behaviour drift across restart, and every initial pair got dialed three times (sysgo bidi=2 + DSL=1).

Move peer tracking into the sysgo nodes themselves. OpReth, OpGeth, and SuperNode embed a peerRegistry; SuperNodeProxy forwards through to its underlying SuperNode. connectL2ELPeersBidi and connectL2CLPeers register a closure on both participants that re-invokes the same dial; each node replays its registry as the final step of Start / startLocked. SuperNode also records per-chain RPC routes and waits for both optimism_rollupConfig and opp2p_self to answer before replaying connectors.

connectL2CLPeersOnce now refreshes peer info on each retry and tries every advertised address: a just-restarted in-process op-node briefly lists historic listener addresses alongside the current one, and a snapshot of Addresses[0] would target a dead port.

DSL managedPeers / ManagePeer / restoreManagedPeers / AttachFrontends(cls) are gone; the Supernode DSL keeps only frontedELs for the WipeELs path.
…FreshDataDir

The DSL's RestartWithFreshDataDir had two paths depending on whether ELs
were wiped. Inline the stop into the DSL so both paths share a single
flow: Stop -> optionally wipe and restart fronted ELs -> StartWithFreshDataDir.

Drop the now-redundant SupernodeTestControl.RestartWithFreshDataDir, and
rename StopForExternalWipe to Stop since it's the only stop path.
…pe path

A chain has one sequencer; the PeerEL preset now reads as one sequencer
plus one verifier (the supernode-fronted node). Rename the helpers and
types accordingly: addPeerELSiblingSequencer -> addPeerELSequencer,
siblingSequencerFrontend -> sequencerFrontend, sequencer-sibling identity
-> sequencer-peer.

In the DSL, drop the unnecessary frontedELs non-empty assertion and
collapse the WipeELs log branch into a single log line.
…pernode

The PeerEL preset previously reused the base supernode in NonSequencer
mode and added stand-alone sequencer op-nodes, which made the
TwoL2SupernodeInterop embedded fields (Supernode, L2ELA, ...) mean
'verifier side' in this preset but 'sequencer side' everywhere else.

Restructure so the base preset's meaning is preserved: the embedded
Supernode + L2ELA/L2ELB remain the sequencer side, and the PeerEL preset
adds a second supernode in NonSequencer/ELSync/NoDiscovery mode with its
own EL per chain. The two supernodes' VNs are CL-peered and the verifier
ELs are bidi-peered with the sequencer ELs over EL devp2p. The resync
tests now wipe and assert against the verifier side.

Rename AttachWipeableELs to AttachELs to reflect that the attached ELs
are the supernode's full driven set, and call it on both supernodes.
@ajsutton
Copy link
Copy Markdown
Contributor Author

Abandoning this because unless we have a very long L1, whenever we wipe the CL and EL together it will just resync from genesis.

op-node/rollup/sync/start.go:159 (FindL2Heads) has a recovery branch (lines 169–177) for exactly this "EL synced unsafe but safe is genesis" situation, but it's gated:

  if result.Finalized.Hash == cfg.Genesis.L2.Hash &&
     result.Safe.Hash == cfg.Genesis.L2.Hash &&
     result.Unsafe.Number > cfg.Genesis.L2.Number &&
     result.Unsafe.L1Origin.Number > cfg.Genesis.L1.Number + (RecoverMinSeqWindows*cfg.SeqWindowSize) {
         … L2HeadsForELSyncWithOffset(…)
  }

RecoverMinSeqWindows = 14 (start.go:66). With a typical SeqWindowSize of ~3,600, that's > 50,000 L1 blocks of head-room required. In this test L1 is at block 20-something. The condition fails by
orders of magnitude, so L2HeadsForELSyncWithOffset isn't taken and FindL2Heads falls through to the normal walk-back, which honours what the EL reported: LocalSafe = L2 genesis.

@ajsutton ajsutton closed this May 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants