Skip to content

fix(op-conductor): tolerate interop reorg recovery health#20817

Open
wwared wants to merge 12 commits into
developfrom
fix/conductor-interop-reorg-recovery
Open

fix(op-conductor): tolerate interop reorg recovery health#20817
wwared wants to merge 12 commits into
developfrom
fix/conductor-interop-reorg-recovery

Conversation

@wwared
Copy link
Copy Markdown
Contributor

@wwared wwared commented May 15, 2026

Summary

This PR makes conductor health checks tolerate transient recovery conditions without masking sustained sequencer failures, and extends devstack so conductor-backed supernode interop scenarios can be tested end to end.

The conductor health monitor now evaluates three CL-side conditions from first principles:

  • Sync-status RPC availability: temporary optimism_syncStatus failures are tolerated inside a shared rolling window; a full window of failures reports ErrSequencerConnectionDown.
  • Unsafe-head progress: a sequencer is healthy while unsafe-head lag is within the configured interval, or while lag is actively shrinking during recovery. If recovery stops improving for a full window, or remains above the unhealthy ceiling after the recovery window, the sequencer reports unhealthy.
  • CL peer count: low peer count and peer-stat RPC failures use the same rolling-window behavior, so transient peer churn does not immediately trip conductor health, but sustained failure still does.

Existing safe-head, EL P2P, and rollup-boost health checks remain part of the overall health decision.

Closes #20006

This PR does not change/address the fact that once a sequencer is stopped, it will never become healthy again by itself and requires manual operator intervention to be restarted. This means an op-node restart with sequencerEnabled=true, sequencerStopped=true will not be reactivated by the conductor. This matches current behavior.

Additional follow-up work: #20854 would make the errors returned during rewind/reorg more consistent.

Devstack changes

  • NewMinimalWithConductors now applies configurable conductor health-check settings, connects conductor op-node peers, waits for the sequencer op-node to become active, and seeds the conductor with the current unsafe payload after leader election.
  • Added NewTwoL2SupernodeInteropWithConductors, which runs the two-L2 shared-supernode interop preset with one conductor-controlled sequencer per L2 chain.
  • Supernode virtual nodes can now be wired to conductor RPC endpoints before the conductor starts; the endpoint is resolved once the conductor service is available.
  • The conductor-backed supernode runtime creates exactly MinPeerCount CL health peers per L2 chain, named conductor-health-peer-1, conductor-health-peer-2, etc.
  • Added devstack-only preset options for conductor health checks, including WithConductorHealthCheckMinPeerCount. The default remains MinPeerCount == 1.

Tests

  • Added conductor health monitor tests for unsafe-head recovery, stopped sequencers, sync-status RPC grace windows, and peer-count grace windows.
  • Added an acceptance test that stops a conductor-managed sequencer and verifies conductor health turns unhealthy.
  • Added a supernode interop acceptance test that creates an invalid-message reorg while watching conductor health, using MinPeerCount=2 to prove the devstack starts enough health peers.
  • Added preset option tests for conductor health-check min-peer configuration and option ordering.

Verification

Run locally:

mise exec -- go test ./op-conductor/conductor ./op-conductor/health -count=1
mise exec -- go test ./op-devstack/sysgo ./op-devstack/presets -count=1
mise exec -- env RUST_JIT_BUILD=1 go test ./op-acceptance-tests/tests/supernode/interop/reorg -run TestSupernodeInteropInvalidMessageReorgKeepsConductorHealthy -count=1 -timeout 5m
git diff --check

@wwared wwared force-pushed the fix/conductor-interop-reorg-recovery branch 3 times, most recently from 0425fec to d13f29f Compare May 19, 2026 03:53
@wwared wwared marked this pull request as ready for review May 19, 2026 13:13
@wwared wwared requested a review from a team as a code owner May 19, 2026 13:13
@wwared wwared force-pushed the fix/conductor-interop-reorg-recovery branch from 1194b3b to 7ac1628 Compare May 19, 2026 14:44
Copy link
Copy Markdown
Contributor

@jelias2 jelias2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The recovery state machine introduced in this PR won't survive interop reorgs of any meaningful depth. Here's why:

The ceiling check on line 351 of monitor.go fires after recoveringWindowSize (5) polls and kills health if curUnsafeLag > 3 * unsafeInterval. In production, unsafeInterval=2s, so the ceiling is 6 seconds — 3 blocks. Any interop reorg deeper than that triggers an unhealthy verdict even if the sequencer is actively
rebuilding the chain.

The ceiling check also fires before the shrinking-lag check, so it doesn't matter if the sequencer is making progress. Once pollsInRecovery hits 5 and the lag is above 6 seconds, it's dead.

For interop reorgs, the chain rewinds to the parent of the invalid block — which could be minutes behind if cross-safe hasn't advanced recently. The sequencer then rebuilds blocks as fast as the engine allows, but it can't close a multi-minute gap down to 6 seconds within 5 polls. The conductor declares the
sequencer unhealthy and triggers a leader transfer — exactly the false positive this PR is trying to prevent.

Why "just increase the lag window" doesn't work

unsafeInterval controls both when recovery mode is entered (steady state detection) and the ceiling during recovery (3 * unsafeInterval). If you increase it to e.g. 10 minutes to tolerate deep reorgs, you also make the conductor blind to a genuinely dead sequencer for 10 minutes before it even starts noticing. That
defeats the purpose of the conductor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

op-conductor: gracefully recover from interop reorgs

2 participants