fix(op-conductor): tolerate interop reorg recovery health#20817
fix(op-conductor): tolerate interop reorg recovery health#20817wwared wants to merge 12 commits into
Conversation
0425fec to
d13f29f
Compare
1194b3b to
7ac1628
Compare
jelias2
left a comment
There was a problem hiding this comment.
The recovery state machine introduced in this PR won't survive interop reorgs of any meaningful depth. Here's why:
The ceiling check on line 351 of monitor.go fires after recoveringWindowSize (5) polls and kills health if curUnsafeLag > 3 * unsafeInterval. In production, unsafeInterval=2s, so the ceiling is 6 seconds — 3 blocks. Any interop reorg deeper than that triggers an unhealthy verdict even if the sequencer is actively
rebuilding the chain.
The ceiling check also fires before the shrinking-lag check, so it doesn't matter if the sequencer is making progress. Once pollsInRecovery hits 5 and the lag is above 6 seconds, it's dead.
For interop reorgs, the chain rewinds to the parent of the invalid block — which could be minutes behind if cross-safe hasn't advanced recently. The sequencer then rebuilds blocks as fast as the engine allows, but it can't close a multi-minute gap down to 6 seconds within 5 polls. The conductor declares the
sequencer unhealthy and triggers a leader transfer — exactly the false positive this PR is trying to prevent.
Why "just increase the lag window" doesn't work
unsafeInterval controls both when recovery mode is entered (steady state detection) and the ceiling during recovery (3 * unsafeInterval). If you increase it to e.g. 10 minutes to tolerate deep reorgs, you also make the conductor blind to a genuinely dead sequencer for 10 minutes before it even starts noticing. That
defeats the purpose of the conductor.
Summary
This PR makes conductor health checks tolerate transient recovery conditions without masking sustained sequencer failures, and extends devstack so conductor-backed supernode interop scenarios can be tested end to end.
The conductor health monitor now evaluates three CL-side conditions from first principles:
optimism_syncStatusfailures are tolerated inside a shared rolling window; a full window of failures reportsErrSequencerConnectionDown.Existing safe-head, EL P2P, and rollup-boost health checks remain part of the overall health decision.
Closes #20006
This PR does not change/address the fact that once a sequencer is stopped, it will never become healthy again by itself and requires manual operator intervention to be restarted. This means an op-node restart with
sequencerEnabled=true, sequencerStopped=truewill not be reactivated by the conductor. This matches current behavior.Additional follow-up work: #20854 would make the errors returned during rewind/reorg more consistent.
Devstack changes
NewMinimalWithConductorsnow applies configurable conductor health-check settings, connects conductor op-node peers, waits for the sequencer op-node to become active, and seeds the conductor with the current unsafe payload after leader election.NewTwoL2SupernodeInteropWithConductors, which runs the two-L2 shared-supernode interop preset with one conductor-controlled sequencer per L2 chain.MinPeerCountCL health peers per L2 chain, namedconductor-health-peer-1,conductor-health-peer-2, etc.WithConductorHealthCheckMinPeerCount. The default remainsMinPeerCount == 1.Tests
MinPeerCount=2to prove the devstack starts enough health peers.Verification
Run locally: