test(op-acceptance-tests): interop startup-rework resync tests with EL wipe by ajsutton · Pull Request #20827 · ethereum-optimism/optimism

ajsutton · 2026-05-18T05:00:16Z

Adds tests 2 and 4 from the interop startup-rework acceptance plan, folded into the post- and pre-activation tests as EL data wiped subtests via t.Run. The fix the EL-wiped path was originally written to exercise has been split out to #20855; these tests cover the resync mechanics on their own.

TwoL2SupernodeInteropPeerEL extends TwoL2SupernodeInterop the same way TwoL2SupernodeFollowL2 does. The supernode VN runs in Sync.SyncMode = ELSync, reqresp + discovery disabled; block content reaches the supernode-fronted EL over execution-layer devp2p from a sibling sequencer EL, CL pubsub only carries head announcements.

Peer tracking moved from DSL into sysgo: OpReth, OpGeth, and SuperNode embed a peerRegistry, connectL2ELPeersBidi/connectL2CLPeers register a re-dial closure on both participants, and each node replays its registry as the final step of Start/startLocked. Restart-after-wipe now uses the same dial code as initial setup (bidi + trusted for ELs), instead of the previous one-way+untrusted DSL replay.

op-reth state wipe: new stack.OnDiskStateWiper interface, implemented on *OpReth (RemoveAll data + proof-history dirs, re-run op-reth init + op-reth proofs init via a shared initStorage()). No-op on in-memory op-geth. Called between Stop and Start by Supernode.RestartWithFreshDataDirAndELs.

-count=5 of EL-wiped subtests, locally:

op-geth: PostActivation 5/5 (143–155s), PreActivation 5/5 (30–32s)
op-reth: PostActivation 5/5 (148–167s), PreActivation 5/5 (36–49s)

wiz-0f98cca50a · 2026-05-18T19:01:54Z

Wiz Scan Summary

Scanner	Findings
Vulnerabilities	-
Sensitive Data	-
Secrets	-
IaC Misconfigurations	-
SAST Findings	3
Software Management Findings	-

Total	3

View scan details in Wiz

To detect these findings earlier in the dev lifecycle, try using Wiz Code VS Code Extension.

wwared · 2026-05-18T22:54:03Z

+	s.l2ACL.ConnectPeer(s.seqCLA)
+	s.l2BCL.ConnectPeer(s.seqCLB)
+	s.l2ELA.PeerWith(s.seqELA)
+	s.l2ELB.PeerWith(s.seqELB)


nit/question: Is there a reason you don't call connectL2ELPeersBidi here? Or duplicate the calls so we connect the peers both ways?

The tests are working so I guess it's unnecessary, but then does that mean we don't even technically need connectL2ELPeersBidi? Does it make a difference to connect A->B + B->A for either the CL or the EL?

Replace the EL-finalized-head cold-start heuristic with a deterministic verifiedDB-resume / SafeDB-first-entry model. - Resume always wins. Any committed verifiedDB entry resumes at LastTimestamp+1 with no SafeDB or chain RPC consultation. - Cold start (no verifiedDB) waits for every chain to record a first SafeDB entry, then sets verificationStartTimestamp = max(activationTimestamp, max_c first-safe-head timestamp). Wall-clock time is never consulted; chain derivation progress is the only authoritative signal relative to activation. - Backfill lower bound is max(activation, per-chain genesis time, verificationStart - depth). Hard fails if any chain cannot serve the range. reconcileLogsDBTail runs only during cold-start backfill; warm-restart paths rely on DecisionRewind for drift handling. - Start splits into a fast init plus a stateful main loop. The loop drives both cold-start init and progressAndRecord, so Start never blocks on multi-day EL sync waits and per-iteration backoff / cancellation / observability come for free. - firstVerifiableTimestamp is now a synchronous accessor backed by verifiedDB.FirstTimestamp and verificationStartTimestamp; RPC handlers return ErrNotStarted while initialization is in progress. API surface: - SafeDBReader.FirstEntry on op-node, exposed through VirtualNode and ChainContainer (FirstSafeHeadTimestamp). ChainContainer reports ErrSafeDBEmpty during cold-start wait. Tests: - New startup_test.go covers fastInit resume / cold-start, the advanceColdStartInit state machine, the cold-start backfill no-op paths, the per-chain genesis clamp, and ErrNotStarted RPC semantics. - Deletes log_backfill_test.go and the obsolete EL-finalized-head startup tests in interop_test.go — replaced by the above.

…nsient cold-start errors - Rename fastInit to tryInitFromVerifiedDB. - Split runLoop so each iteration runs exactly one of waitForColdStartInit or progress and the loop is the only place that sleeps; each step returns the duration it wants to wait. - Inject clock.Clock so tests can drive time deterministically; replace the remaining time.Sleep/time.Now/time.Since calls. - Treat all advanceColdStartInit errors as retryable (logged + errorBackoff). Cold start races chain-container startup, so transient signals like virtual-node-not-running must not kill the activity; cold start has no ErrHistoryUnavailable path to handle.

backfillAttempts/backfillCompleted now reference advanceColdStartInit instead of the removed runLogBackfill, and InteropTestControl's accessor list swaps the removed BackfillEndTimestamp for VerificationStartTimestamp.

Adds two acceptance tests covering the cold-start resync path introduced by the interop startup rework: a post-activation case and a pre-activation case. Both stop the supernode, delete its entire on-disk data dir, and start a fresh instance against the same chains and virtual nodes, then assert the verifier resumes at a sane verificationStartTimestamp and drives forward. Replaces the prior `RestartInteropActivity(wipeLogsDBs bool)` test primitive — wiping only the logs DBs is no longer meaningful under the rework — with `RestartWithFreshDataDir`, which performs a full supernode Stop + data-dir wipe + Start. The sysgo SuperNode gains a long-lived tcpproxy so its externally visible RPC URL survives the restart. The old `TestSupernodeLogBackfill_HappyPath` (and its backfill/ shared helpers) is removed; its scenario is unreachable under the rework and the new tests cover the legitimate cold-start path.

Set backfillCompleted in the resume path so the flag reflects "no more backfill planned" in every branch, and BackfillCompleted() can be a single atomic read.

…erage Adds tests for the four cases removed with log_backfill_test.go that are still load-bearing under the new model — cold-start backfill is the only path that calls reconcileLogsDBTail, and these exercise its branches: - MisalignedActivation: floor(activation / blockTime) anchor invariant. - RecoversFromOfflineReorg: stale tail, reconcile clears, backfill seals. - LeavesAheadLogsDBUnchanged: tip past endTime + canonical, no-op. - TrimsNonCanonicalAheadLogsDBAndCatchesUp: tip past endTime + divergent, reconcile rewinds then backfill catches up.

Adds two acceptance tests covering the cold-start resync path introduced by the interop startup rework: a post-activation case and a pre-activation case. Both stop the supernode, delete its entire on-disk data dir, and start a fresh instance against the same chains and virtual nodes, then assert the verifier resumes at a sane verificationStartTimestamp and drives forward. Replaces the prior `RestartInteropActivity(wipeLogsDBs bool)` test primitive — wiping only the logs DBs is no longer meaningful under the rework — with `RestartWithFreshDataDir`, which performs a full supernode Stop + data-dir wipe + Start. The sysgo SuperNode gains a long-lived tcpproxy so its externally visible RPC URL survives the restart. The old `TestSupernodeLogBackfill_HappyPath` (and its backfill/ shared helpers) is removed; its scenario is unreachable under the rework and the new tests cover the legitimate cold-start path.

- Delete startup_resync/testutil.go; constants inlined into each test file; cross-safe advance check now composes existing L2CLNode.AdvancedFn(types.CrossSafe, ...) via dsl.CheckAll. - Replace pre-restart history wait with stronger signals: the post-activation test waits for the supernode to validate a post-activation timestamp; the pre-activation test waits for two local-safe block advances so the op-node SafeDB has entries the cold-start init can consult. - Drop redundant pre-restart VerificationStartTimestamp >= activation checks and the temporary AwaitCrossSafeAdvancesPast wrapper. - Trim docstrings on the new DSL helpers and sysgo SuperNode methods.

Setup is materially identical with activation at genesis: the post-restart cold-start init path is the same, since the data-dir wipe clears the VN safe_db too and post-restart firstSafeDB sits at the current chain tip in either case. Saves ~20s of pre-restart wall time.

- Post-activation: replace the timestamp-based wait with AdvancedFn on CrossSafe — drops the dependency on RollupConfig().BlockTime and lets the test express its intent (the verifier is committing) directly. - Pre-activation: drop AdvanceTime and the post-activation validation step. The pre-activation cold-start behaviour is the whole point of the test; the verifier transition at activation is covered elsewhere. Bump activationDelay to 300s so the chain tip stays comfortably pre-activation under CI variance.

We no longer share devnets across packages, so there's no need to keep the post-activation and pre-activation cases in separate sub-packages. Both tests now live in a single startup_resync_test.go file.

- Add presets.WithUniformL2BlockTimes (and the sysgo deployer option it wraps) so a single literal sets the block time on every configured L2 chain. - Switch the post-activation test to a 1s block time with a 3s backfill depth, and wait for L2 finalized to advance several blocks on each chain before the restart. On full data-dir wipe op-node drops the derivation pipeline back to the finalized head, not L2 genesis, so SafeDB's first entry lands well past activation and the cold-start backfill window is non-trivial. Without this the backfill range always collapses to empty against a re-recorded genesis SafeDB entry. - AssertBackfillCovers now verifies the backfill repopulated the logs DB. - Drop WithTimeTravelEnabled from both tests (no longer needed) and inline preActivationDelay as 60*60 seconds so the chain stays pre-activation through the whole pre-activation test. - Rename TestSupernodeResyncResumesAtActivation_PreActivation to TestSupernodeResyncSchedulesAtActivation_PreActivation; this test only covers the scheduled-but-not-active regime.

Split the supernode test-control restart into separate Stop / StartWithFreshDataDir phases so callers can interleave wipes of sibling components (e.g. an EL data dir) between supernode teardown and bring-up. Adds dsl.Supernode.RestartWithFreshDataDirAndELs to orchestrate the combined wipe and a tolerant AwaitVerificationStartsAtOrAfter assertion for cold-start scenarios where the exact handoff timestamp depends on when the first SafeDB entry is derived.

…quencer Adds a two-L2 supernode interop preset where, per chain, a sibling op-node + op-geth pair acts as the canonical sequencer (driving the batcher and proposer) and the supernode's virtual node runs a separate non-sequencer op-geth, with both ELs P2P-peered and both CLs P2P-peered. This lets tests wipe the supernode data dir together with the supernode-fronted EL while the sibling pair keeps producing blocks. Block transport into the wiped EL flows through the supernode VN's CL reqresp from the sibling CL plus engine-API payload insertion — the in-process devstack op-geth does not run a real execution-layer sync protocol between sibling instances, so true ELSync-over-devp2p is not used here.

… EL wipe Adds the EL-also-wiped variants of the existing post-activation and pre-activation supernode cold-start tests. Each new test uses the new TwoL2SupernodeInteropPeerEL preset so a sibling sequencer pair keeps producing blocks while the supernode and its fronted EL are wiped together, then asserts cold-start init completes and cross-safe advances after the restart.

The previous version of TwoL2SupernodeInteropPeerEL ran the supernode VN in CL-sync + reqresp so block content reached the wiped EL through CL gossip. That side-stepped the test's intent and the supernode's production sync mode. Rewire so the supernode VN runs as a non-sequencer with Sync.SyncMode = ELSync, reqresp disabled, and discovery off. Block content into the supernode-fronted EL now flows over execution-layer devp2p from the sibling sequencer EL; CL pubsub is only used to announce new heads so the VN can issue forkchoice updates. Both ELs and both CLs are explicitly peered via admin_addPeer-style helpers at preset setup, and the tests re-peer them after the supernode + EL wipe (discovery is off and the wipe drops every peer table). Supporting changes: * supernodeVNMode now exposes NonSequencer / SyncMode / DisableReqRespSync / NoDiscovery so the peer-EL preset can drive the supernode VN through the regular config path. * connectL2ELPeersBidi adds admin_addPeer in both directions for the small two-node EL topology. * Preset exposes SequencerL2A/B EL+CL so tests can address the sibling sequencer directly; the supernode-fronted EL becomes the TwoL2 L2ELA/B field (the wipe target). * Tests call ConnectPeer (CL pubsub) and PeerWith (EL devp2p) after the combined wipe, then assert cold-start init completes and cross-safe advances. Engine logs show 'Starting EL sync' and 'Finished EL sync' after each restart, confirming genuine ELSync.

Folds the four tests into two parameterised functions with EL-intact and EL-wiped subtests. The preset choice is hidden inside a resyncSystem helper. Dedups the peer-EL preset against the single-node preset, mirroring the TwoL2SupernodeFollowL2 pattern: the shared runtime gains a SupernodeVNMode toggle and a SkipBatcherProposer flag, and the peer-EL constructor calls addPeerELSiblingSequencer to attach a sibling sequencer pair under Followers["sequencer"]. The standalone peer-EL runtime file is removed. Adds an OnDiskStateWiper interface implemented by op-reth so the EL wipe path is a no-op on in-memory op-geth and a genuine RemoveAll+init on op-reth. dsl.L2ELNode.WipeOnDiskState wraps it and is called between Stop and Start by Supernode.RestartWithFreshDataDirAndELs.

…timeout to 60 newResyncSystem was a 4-line wrapper called twice. Inline it. CrossSafe advance to +1 typically resolves within seconds of AwaitBackfillCompleted returning, so 240*2s=8min was orders of magnitude too generous; 60*2s=2min still leaves ample margin.

…rage() Same data-dir init + proofs-init invocations were inlined at first start in startMixedOpRethNode and re-implemented in WipeOnDiskState. Extract to a private method on *OpReth and call from both.

…start Peer-relationship tracking lived in DSL: ConnectPeer/PeerWith pushed entries into a managedPeers map on each side, and Start hooks re-dialed them after restart. Initial wiring was bidi+trusted via sysgo's connectL2ELPeersBidi, but restart-replay was one-way+untrusted via DSL — silent behaviour drift across restart, and every initial pair got dialed three times (sysgo bidi=2 + DSL=1). Move peer tracking into the sysgo nodes themselves. OpReth, OpGeth, and SuperNode embed a peerRegistry; SuperNodeProxy forwards through to its underlying SuperNode. connectL2ELPeersBidi and connectL2CLPeers register a closure on both participants that re-invokes the same dial; each node replays its registry as the final step of Start / startLocked. SuperNode also records per-chain RPC routes and waits for both optimism_rollupConfig and opp2p_self to answer before replaying connectors. connectL2CLPeersOnce now refreshes peer info on each retry and tries every advertised address: a just-restarted in-process op-node briefly lists historic listener addresses alongside the current one, and a snapshot of Addresses[0] would target a dead port. DSL managedPeers / ManagePeer / restoreManagedPeers / AttachFrontends(cls) are gone; the Supernode DSL keeps only frontedELs for the WipeELs path.

…FreshDataDir The DSL's RestartWithFreshDataDir had two paths depending on whether ELs were wiped. Inline the stop into the DSL so both paths share a single flow: Stop -> optionally wipe and restart fronted ELs -> StartWithFreshDataDir. Drop the now-redundant SupernodeTestControl.RestartWithFreshDataDir, and rename StopForExternalWipe to Stop since it's the only stop path.

…pe path A chain has one sequencer; the PeerEL preset now reads as one sequencer plus one verifier (the supernode-fronted node). Rename the helpers and types accordingly: addPeerELSiblingSequencer -> addPeerELSequencer, siblingSequencerFrontend -> sequencerFrontend, sequencer-sibling identity -> sequencer-peer. In the DSL, drop the unnecessary frontedELs non-empty assertion and collapse the WipeELs log branch into a single log line.

…pernode The PeerEL preset previously reused the base supernode in NonSequencer mode and added stand-alone sequencer op-nodes, which made the TwoL2SupernodeInterop embedded fields (Supernode, L2ELA, ...) mean 'verifier side' in this preset but 'sequencer side' everywhere else. Restructure so the base preset's meaning is preserved: the embedded Supernode + L2ELA/L2ELB remain the sequencer side, and the PeerEL preset adds a second supernode in NonSequencer/ELSync/NoDiscovery mode with its own EL per chain. The two supernodes' VNs are CL-peered and the verifier ELs are bidi-peered with the sequencer ELs over EL devp2p. The resync tests now wipe and assert against the verifier side. Rename AttachWipeableELs to AttachELs to reflect that the attached ELs are the supernode's full driven set, and call it on both supernodes.

ajsutton · 2026-05-19T20:11:40Z

Abandoning this because unless we have a very long L1, whenever we wipe the CL and EL together it will just resync from genesis.

op-node/rollup/sync/start.go:159 (FindL2Heads) has a recovery branch (lines 169–177) for exactly this "EL synced unsafe but safe is genesis" situation, but it's gated:

  if result.Finalized.Hash == cfg.Genesis.L2.Hash &&
     result.Safe.Hash == cfg.Genesis.L2.Hash &&
     result.Unsafe.Number > cfg.Genesis.L2.Number &&
     result.Unsafe.L1Origin.Number > cfg.Genesis.L1.Number + (RecoverMinSeqWindows*cfg.SeqWindowSize) {
         … L2HeadsForELSyncWithOffset(…)
  }

RecoverMinSeqWindows = 14 (start.go:66). With a typical SeqWindowSize of ~3,600, that's > 50,000 L1 blocks of head-room required. In this test L1 is at block 20-something. The condition fails by
orders of magnitude, so L2HeadsForELSyncWithOffset isn't taken and FindL2Heads falls through to the normal walk-back, which honours what the EL reported: LocalSafe = L2 genesis.

ajsutton requested a review from a team as a code owner May 18, 2026 05:00

wiz-0f98cca50a Bot reviewed May 18, 2026

View reviewed changes

Comment thread op-devstack/sysgo/l2_el_opreth.go

Comment thread op-devstack/sysgo/l2_el_opreth.go Outdated

Comment thread op-devstack/sysgo/l2_el_opreth.go Outdated

ajsutton force-pushed the aj/test-interop-startup-rework-acceptance branch from fee2d4b to 74d5599 Compare May 18, 2026 20:15

ajsutton force-pushed the aj/test-interop-startup-rework-cl-el-reset branch from 025b440 to d0bd226 Compare May 18, 2026 20:16

Base automatically changed from aj/test-interop-startup-rework-acceptance to aj/feat/interop-startup-rework May 18, 2026 21:37

ajsutton force-pushed the aj/test-interop-startup-rework-cl-el-reset branch from d0bd226 to 32f1cc7 Compare May 18, 2026 22:35

wwared approved these changes May 18, 2026

View reviewed changes

ajsutton force-pushed the aj/feat/interop-startup-rework branch from 01c0632 to aad05dd Compare May 18, 2026 23:03

ajsutton force-pushed the aj/test-interop-startup-rework-cl-el-reset branch from 32f1cc7 to fcece54 Compare May 18, 2026 23:03

ajsutton changed the title ~~test(op-acceptance-tests): interop startup-rework resync tests with EL wipe~~ fix(op-supernode): stabilize first SafeDB snapshot in cold-start + add interop resync ATs May 19, 2026

Base automatically changed from aj/feat/interop-startup-rework to develop May 19, 2026 01:48

ajsutton added 17 commits May 19, 2026 12:14

fix(op-supernode/interop): drop unsynchronized waitingForSync read

0bcdc40

Set backfillCompleted in the resume path so the flag reflects "no more backfill planned" in every branch, and BackfillCompleted() can be a single atomic read.

test: collapse startup-resync tests into a single package

6e44756

We no longer share devnets across packages, so there's no need to keep the post-activation and pre-activation cases in separate sub-packages. Both tests now live in a single startup_resync_test.go file.

ajsutton force-pushed the aj/test-interop-startup-rework-cl-el-reset branch from 77a0e75 to 24f652c Compare May 19, 2026 02:15

ajsutton added 4 commits May 19, 2026 12:58

test(op-acceptance-tests): inline single-use preActivationDelay

3537368

test(op-acceptance-tests): remove trailing blank line (goimports)

8aa4bcc

refactor(op-devstack/sysgo): dedupe op-reth storage init into initSto…

3b23e1b

…rage() Same data-dir init + proofs-init invocations were inlined at first start in startMixedOpRethNode and re-implemented in WipeOnDiskState. Extract to a private method on *OpReth and call from both.

ajsutton force-pushed the aj/test-interop-startup-rework-cl-el-reset branch from befbee9 to 3b23e1b Compare May 19, 2026 02:58

ajsutton changed the title ~~fix(op-supernode): stabilize first SafeDB snapshot in cold-start + add interop resync ATs~~ test(op-acceptance-tests): interop startup-rework resync tests with EL wipe May 19, 2026

ajsutton added 4 commits May 19, 2026 13:55

ajsutton closed this May 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(op-acceptance-tests): interop startup-rework resync tests with EL wipe#20827

test(op-acceptance-tests): interop startup-rework resync tests with EL wipe#20827
ajsutton wants to merge 25 commits into
developfrom
aj/test-interop-startup-rework-cl-el-reset

ajsutton commented May 18, 2026 •

edited

Loading

Uh oh!

wiz-0f98cca50a Bot commented May 18, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wwared May 18, 2026 •

edited

Loading

Uh oh!

ajsutton commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ajsutton commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wiz-0f98cca50a Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Wiz Scan Summary

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wwared May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ajsutton commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ajsutton commented May 18, 2026 •

edited

Loading

wiz-0f98cca50a Bot commented May 18, 2026 •

edited

Loading

wwared May 18, 2026 •

edited

Loading