test(op-acceptance-tests): interop startup-rework resync tests with EL wipe#20827
test(op-acceptance-tests): interop startup-rework resync tests with EL wipe#20827ajsutton wants to merge 25 commits into
Conversation
Wiz Scan Summary
To detect these findings earlier in the dev lifecycle, try using Wiz Code VS Code Extension. |
fee2d4b to
74d5599
Compare
025b440 to
d0bd226
Compare
d0bd226 to
32f1cc7
Compare
| s.l2ACL.ConnectPeer(s.seqCLA) | ||
| s.l2BCL.ConnectPeer(s.seqCLB) | ||
| s.l2ELA.PeerWith(s.seqELA) | ||
| s.l2ELB.PeerWith(s.seqELB) |
There was a problem hiding this comment.
nit/question: Is there a reason you don't call connectL2ELPeersBidi here? Or duplicate the calls so we connect the peers both ways?
The tests are working so I guess it's unnecessary, but then does that mean we don't even technically need connectL2ELPeersBidi? Does it make a difference to connect A->B + B->A for either the CL or the EL?
01c0632 to
aad05dd
Compare
32f1cc7 to
fcece54
Compare
Replace the EL-finalized-head cold-start heuristic with a deterministic verifiedDB-resume / SafeDB-first-entry model. - Resume always wins. Any committed verifiedDB entry resumes at LastTimestamp+1 with no SafeDB or chain RPC consultation. - Cold start (no verifiedDB) waits for every chain to record a first SafeDB entry, then sets verificationStartTimestamp = max(activationTimestamp, max_c first-safe-head timestamp). Wall-clock time is never consulted; chain derivation progress is the only authoritative signal relative to activation. - Backfill lower bound is max(activation, per-chain genesis time, verificationStart - depth). Hard fails if any chain cannot serve the range. reconcileLogsDBTail runs only during cold-start backfill; warm-restart paths rely on DecisionRewind for drift handling. - Start splits into a fast init plus a stateful main loop. The loop drives both cold-start init and progressAndRecord, so Start never blocks on multi-day EL sync waits and per-iteration backoff / cancellation / observability come for free. - firstVerifiableTimestamp is now a synchronous accessor backed by verifiedDB.FirstTimestamp and verificationStartTimestamp; RPC handlers return ErrNotStarted while initialization is in progress. API surface: - SafeDBReader.FirstEntry on op-node, exposed through VirtualNode and ChainContainer (FirstSafeHeadTimestamp). ChainContainer reports ErrSafeDBEmpty during cold-start wait. Tests: - New startup_test.go covers fastInit resume / cold-start, the advanceColdStartInit state machine, the cold-start backfill no-op paths, the per-chain genesis clamp, and ErrNotStarted RPC semantics. - Deletes log_backfill_test.go and the obsolete EL-finalized-head startup tests in interop_test.go — replaced by the above.
…nsient cold-start errors - Rename fastInit to tryInitFromVerifiedDB. - Split runLoop so each iteration runs exactly one of waitForColdStartInit or progress and the loop is the only place that sleeps; each step returns the duration it wants to wait. - Inject clock.Clock so tests can drive time deterministically; replace the remaining time.Sleep/time.Now/time.Since calls. - Treat all advanceColdStartInit errors as retryable (logged + errorBackoff). Cold start races chain-container startup, so transient signals like virtual-node-not-running must not kill the activity; cold start has no ErrHistoryUnavailable path to handle.
backfillAttempts/backfillCompleted now reference advanceColdStartInit instead of the removed runLogBackfill, and InteropTestControl's accessor list swaps the removed BackfillEndTimestamp for VerificationStartTimestamp.
Adds two acceptance tests covering the cold-start resync path introduced by the interop startup rework: a post-activation case and a pre-activation case. Both stop the supernode, delete its entire on-disk data dir, and start a fresh instance against the same chains and virtual nodes, then assert the verifier resumes at a sane verificationStartTimestamp and drives forward. Replaces the prior `RestartInteropActivity(wipeLogsDBs bool)` test primitive — wiping only the logs DBs is no longer meaningful under the rework — with `RestartWithFreshDataDir`, which performs a full supernode Stop + data-dir wipe + Start. The sysgo SuperNode gains a long-lived tcpproxy so its externally visible RPC URL survives the restart. The old `TestSupernodeLogBackfill_HappyPath` (and its backfill/ shared helpers) is removed; its scenario is unreachable under the rework and the new tests cover the legitimate cold-start path.
Set backfillCompleted in the resume path so the flag reflects "no more backfill planned" in every branch, and BackfillCompleted() can be a single atomic read.
…erage Adds tests for the four cases removed with log_backfill_test.go that are still load-bearing under the new model — cold-start backfill is the only path that calls reconcileLogsDBTail, and these exercise its branches: - MisalignedActivation: floor(activation / blockTime) anchor invariant. - RecoversFromOfflineReorg: stale tail, reconcile clears, backfill seals. - LeavesAheadLogsDBUnchanged: tip past endTime + canonical, no-op. - TrimsNonCanonicalAheadLogsDBAndCatchesUp: tip past endTime + divergent, reconcile rewinds then backfill catches up.
Adds two acceptance tests covering the cold-start resync path introduced by the interop startup rework: a post-activation case and a pre-activation case. Both stop the supernode, delete its entire on-disk data dir, and start a fresh instance against the same chains and virtual nodes, then assert the verifier resumes at a sane verificationStartTimestamp and drives forward. Replaces the prior `RestartInteropActivity(wipeLogsDBs bool)` test primitive — wiping only the logs DBs is no longer meaningful under the rework — with `RestartWithFreshDataDir`, which performs a full supernode Stop + data-dir wipe + Start. The sysgo SuperNode gains a long-lived tcpproxy so its externally visible RPC URL survives the restart. The old `TestSupernodeLogBackfill_HappyPath` (and its backfill/ shared helpers) is removed; its scenario is unreachable under the rework and the new tests cover the legitimate cold-start path.
- Delete startup_resync/testutil.go; constants inlined into each test file; cross-safe advance check now composes existing L2CLNode.AdvancedFn(types.CrossSafe, ...) via dsl.CheckAll. - Replace pre-restart history wait with stronger signals: the post-activation test waits for the supernode to validate a post-activation timestamp; the pre-activation test waits for two local-safe block advances so the op-node SafeDB has entries the cold-start init can consult. - Drop redundant pre-restart VerificationStartTimestamp >= activation checks and the temporary AwaitCrossSafeAdvancesPast wrapper. - Trim docstrings on the new DSL helpers and sysgo SuperNode methods.
Setup is materially identical with activation at genesis: the post-restart cold-start init path is the same, since the data-dir wipe clears the VN safe_db too and post-restart firstSafeDB sits at the current chain tip in either case. Saves ~20s of pre-restart wall time.
- Post-activation: replace the timestamp-based wait with AdvancedFn on CrossSafe — drops the dependency on RollupConfig().BlockTime and lets the test express its intent (the verifier is committing) directly. - Pre-activation: drop AdvanceTime and the post-activation validation step. The pre-activation cold-start behaviour is the whole point of the test; the verifier transition at activation is covered elsewhere. Bump activationDelay to 300s so the chain tip stays comfortably pre-activation under CI variance.
We no longer share devnets across packages, so there's no need to keep the post-activation and pre-activation cases in separate sub-packages. Both tests now live in a single startup_resync_test.go file.
- Add presets.WithUniformL2BlockTimes (and the sysgo deployer option it wraps) so a single literal sets the block time on every configured L2 chain. - Switch the post-activation test to a 1s block time with a 3s backfill depth, and wait for L2 finalized to advance several blocks on each chain before the restart. On full data-dir wipe op-node drops the derivation pipeline back to the finalized head, not L2 genesis, so SafeDB's first entry lands well past activation and the cold-start backfill window is non-trivial. Without this the backfill range always collapses to empty against a re-recorded genesis SafeDB entry. - AssertBackfillCovers now verifies the backfill repopulated the logs DB. - Drop WithTimeTravelEnabled from both tests (no longer needed) and inline preActivationDelay as 60*60 seconds so the chain stays pre-activation through the whole pre-activation test. - Rename TestSupernodeResyncResumesAtActivation_PreActivation to TestSupernodeResyncSchedulesAtActivation_PreActivation; this test only covers the scheduled-but-not-active regime.
Split the supernode test-control restart into separate Stop / StartWithFreshDataDir phases so callers can interleave wipes of sibling components (e.g. an EL data dir) between supernode teardown and bring-up. Adds dsl.Supernode.RestartWithFreshDataDirAndELs to orchestrate the combined wipe and a tolerant AwaitVerificationStartsAtOrAfter assertion for cold-start scenarios where the exact handoff timestamp depends on when the first SafeDB entry is derived.
…quencer Adds a two-L2 supernode interop preset where, per chain, a sibling op-node + op-geth pair acts as the canonical sequencer (driving the batcher and proposer) and the supernode's virtual node runs a separate non-sequencer op-geth, with both ELs P2P-peered and both CLs P2P-peered. This lets tests wipe the supernode data dir together with the supernode-fronted EL while the sibling pair keeps producing blocks. Block transport into the wiped EL flows through the supernode VN's CL reqresp from the sibling CL plus engine-API payload insertion — the in-process devstack op-geth does not run a real execution-layer sync protocol between sibling instances, so true ELSync-over-devp2p is not used here.
… EL wipe Adds the EL-also-wiped variants of the existing post-activation and pre-activation supernode cold-start tests. Each new test uses the new TwoL2SupernodeInteropPeerEL preset so a sibling sequencer pair keeps producing blocks while the supernode and its fronted EL are wiped together, then asserts cold-start init completes and cross-safe advances after the restart.
The previous version of TwoL2SupernodeInteropPeerEL ran the supernode VN in
CL-sync + reqresp so block content reached the wiped EL through CL gossip.
That side-stepped the test's intent and the supernode's production sync mode.
Rewire so the supernode VN runs as a non-sequencer with Sync.SyncMode = ELSync,
reqresp disabled, and discovery off. Block content into the supernode-fronted
EL now flows over execution-layer devp2p from the sibling sequencer EL; CL
pubsub is only used to announce new heads so the VN can issue forkchoice
updates. Both ELs and both CLs are explicitly peered via admin_addPeer-style
helpers at preset setup, and the tests re-peer them after the supernode + EL
wipe (discovery is off and the wipe drops every peer table).
Supporting changes:
* supernodeVNMode now exposes NonSequencer / SyncMode / DisableReqRespSync
/ NoDiscovery so the peer-EL preset can drive the supernode VN through
the regular config path.
* connectL2ELPeersBidi adds admin_addPeer in both directions for the small
two-node EL topology.
* Preset exposes SequencerL2A/B EL+CL so tests can address the sibling
sequencer directly; the supernode-fronted EL becomes the TwoL2 L2ELA/B
field (the wipe target).
* Tests call ConnectPeer (CL pubsub) and PeerWith (EL devp2p) after the
combined wipe, then assert cold-start init completes and cross-safe
advances. Engine logs show 'Starting EL sync' and 'Finished EL sync'
after each restart, confirming genuine ELSync.
Folds the four tests into two parameterised functions with EL-intact and EL-wiped subtests. The preset choice is hidden inside a resyncSystem helper. Dedups the peer-EL preset against the single-node preset, mirroring the TwoL2SupernodeFollowL2 pattern: the shared runtime gains a SupernodeVNMode toggle and a SkipBatcherProposer flag, and the peer-EL constructor calls addPeerELSiblingSequencer to attach a sibling sequencer pair under Followers["sequencer"]. The standalone peer-EL runtime file is removed. Adds an OnDiskStateWiper interface implemented by op-reth so the EL wipe path is a no-op on in-memory op-geth and a genuine RemoveAll+init on op-reth. dsl.L2ELNode.WipeOnDiskState wraps it and is called between Stop and Start by Supernode.RestartWithFreshDataDirAndELs.
77a0e75 to
24f652c
Compare
…timeout to 60 newResyncSystem was a 4-line wrapper called twice. Inline it. CrossSafe advance to +1 typically resolves within seconds of AwaitBackfillCompleted returning, so 240*2s=8min was orders of magnitude too generous; 60*2s=2min still leaves ample margin.
…rage() Same data-dir init + proofs-init invocations were inlined at first start in startMixedOpRethNode and re-implemented in WipeOnDiskState. Extract to a private method on *OpReth and call from both.
befbee9 to
3b23e1b
Compare
…start Peer-relationship tracking lived in DSL: ConnectPeer/PeerWith pushed entries into a managedPeers map on each side, and Start hooks re-dialed them after restart. Initial wiring was bidi+trusted via sysgo's connectL2ELPeersBidi, but restart-replay was one-way+untrusted via DSL — silent behaviour drift across restart, and every initial pair got dialed three times (sysgo bidi=2 + DSL=1). Move peer tracking into the sysgo nodes themselves. OpReth, OpGeth, and SuperNode embed a peerRegistry; SuperNodeProxy forwards through to its underlying SuperNode. connectL2ELPeersBidi and connectL2CLPeers register a closure on both participants that re-invokes the same dial; each node replays its registry as the final step of Start / startLocked. SuperNode also records per-chain RPC routes and waits for both optimism_rollupConfig and opp2p_self to answer before replaying connectors. connectL2CLPeersOnce now refreshes peer info on each retry and tries every advertised address: a just-restarted in-process op-node briefly lists historic listener addresses alongside the current one, and a snapshot of Addresses[0] would target a dead port. DSL managedPeers / ManagePeer / restoreManagedPeers / AttachFrontends(cls) are gone; the Supernode DSL keeps only frontedELs for the WipeELs path.
…FreshDataDir The DSL's RestartWithFreshDataDir had two paths depending on whether ELs were wiped. Inline the stop into the DSL so both paths share a single flow: Stop -> optionally wipe and restart fronted ELs -> StartWithFreshDataDir. Drop the now-redundant SupernodeTestControl.RestartWithFreshDataDir, and rename StopForExternalWipe to Stop since it's the only stop path.
…pe path A chain has one sequencer; the PeerEL preset now reads as one sequencer plus one verifier (the supernode-fronted node). Rename the helpers and types accordingly: addPeerELSiblingSequencer -> addPeerELSequencer, siblingSequencerFrontend -> sequencerFrontend, sequencer-sibling identity -> sequencer-peer. In the DSL, drop the unnecessary frontedELs non-empty assertion and collapse the WipeELs log branch into a single log line.
…pernode The PeerEL preset previously reused the base supernode in NonSequencer mode and added stand-alone sequencer op-nodes, which made the TwoL2SupernodeInterop embedded fields (Supernode, L2ELA, ...) mean 'verifier side' in this preset but 'sequencer side' everywhere else. Restructure so the base preset's meaning is preserved: the embedded Supernode + L2ELA/L2ELB remain the sequencer side, and the PeerEL preset adds a second supernode in NonSequencer/ELSync/NoDiscovery mode with its own EL per chain. The two supernodes' VNs are CL-peered and the verifier ELs are bidi-peered with the sequencer ELs over EL devp2p. The resync tests now wipe and assert against the verifier side. Rename AttachWipeableELs to AttachELs to reflect that the attached ELs are the supernode's full driven set, and call it on both supernodes.
|
Abandoning this because unless we have a very long L1, whenever we wipe the CL and EL together it will just resync from genesis. op-node/rollup/sync/start.go:159 (FindL2Heads) has a recovery branch (lines 169–177) for exactly this "EL synced unsafe but safe is genesis" situation, but it's gated: RecoverMinSeqWindows = 14 (start.go:66). With a typical SeqWindowSize of ~3,600, that's > 50,000 L1 blocks of head-room required. In this test L1 is at block 20-something. The condition fails by |
Adds tests 2 and 4 from the interop startup-rework acceptance plan, folded into the post- and pre-activation tests as
EL data wipedsubtests viat.Run. The fix the EL-wiped path was originally written to exercise has been split out to #20855; these tests cover the resync mechanics on their own.TwoL2SupernodeInteropPeerELextendsTwoL2SupernodeInteropthe same wayTwoL2SupernodeFollowL2does. The supernode VN runs inSync.SyncMode = ELSync, reqresp + discovery disabled; block content reaches the supernode-fronted EL over execution-layer devp2p from a sibling sequencer EL, CL pubsub only carries head announcements.Peer tracking moved from DSL into sysgo:
OpReth,OpGeth, andSuperNodeembed apeerRegistry,connectL2ELPeersBidi/connectL2CLPeersregister a re-dial closure on both participants, and each node replays its registry as the final step ofStart/startLocked. Restart-after-wipe now uses the same dial code as initial setup (bidi + trusted for ELs), instead of the previous one-way+untrusted DSL replay.op-reth state wipe: new
stack.OnDiskStateWiperinterface, implemented on*OpReth(RemoveAlldata + proof-history dirs, re-runop-reth init+op-reth proofs initvia a sharedinitStorage()). No-op on in-memory op-geth. Called between Stop and Start bySupernode.RestartWithFreshDataDirAndELs.-count=5of EL-wiped subtests, locally: