fix(op-conductor): tolerate interop reorg recovery health by wwared · Pull Request #20817 · ethereum-optimism/optimism

wwared · 2026-05-15T22:11:22Z

Summary

This PR makes conductor health checks tolerate transient recovery conditions without masking sustained sequencer failures, and extends devstack so conductor-backed supernode interop scenarios can be tested end to end.

The conductor health monitor now evaluates three CL-side conditions from first principles:

Sync-status RPC availability: temporary optimism_syncStatus failures are tolerated inside a shared rolling window; a full window of failures reports ErrSequencerConnectionDown.
Unsafe-head progress: a sequencer is healthy while unsafe-head lag is within the configured interval, or while lag is actively shrinking during recovery. If recovery stops improving for a full window, or remains above the unhealthy ceiling after the recovery window, the sequencer reports unhealthy.
CL peer count: low peer count and peer-stat RPC failures use the same rolling-window behavior, so transient peer churn does not immediately trip conductor health, but sustained failure still does.

Existing safe-head, EL P2P, and rollup-boost health checks remain part of the overall health decision.

Closes #20006

This PR does not change/address the fact that once a sequencer is stopped, it will never become healthy again by itself and requires manual operator intervention to be restarted. This means an op-node restart with sequencerEnabled=true, sequencerStopped=true will not be reactivated by the conductor. This matches current behavior.

Additional follow-up work: #20854 would make the errors returned during rewind/reorg more consistent.

Devstack changes

NewMinimalWithConductors now applies configurable conductor health-check settings, connects conductor op-node peers, waits for the sequencer op-node to become active, and seeds the conductor with the current unsafe payload after leader election.
Added NewTwoL2SupernodeInteropWithConductors, which runs the two-L2 shared-supernode interop preset with one conductor-controlled sequencer per L2 chain.
Supernode virtual nodes can now be wired to conductor RPC endpoints before the conductor starts; the endpoint is resolved once the conductor service is available.
The conductor-backed supernode runtime creates exactly MinPeerCount CL health peers per L2 chain, named conductor-health-peer-1, conductor-health-peer-2, etc.
Added devstack-only preset options for conductor health checks, including WithConductorHealthCheckMinPeerCount. The default remains MinPeerCount == 1.

Tests

Added conductor health monitor tests for unsafe-head recovery, stopped sequencers, sync-status RPC grace windows, and peer-count grace windows.
Added an acceptance test that stops a conductor-managed sequencer and verifies conductor health turns unhealthy.
Added a supernode interop acceptance test that creates an invalid-message reorg while watching conductor health, using MinPeerCount=2 to prove the devstack starts enough health peers.
Added preset option tests for conductor health-check min-peer configuration and option ordering.

Verification

Run locally:

mise exec -- go test ./op-conductor/conductor ./op-conductor/health -count=1
mise exec -- go test ./op-devstack/sysgo ./op-devstack/presets -count=1
mise exec -- env RUST_JIT_BUILD=1 go test ./op-acceptance-tests/tests/supernode/interop/reorg -run TestSupernodeInteropInvalidMessageReorgKeepsConductorHealthy -count=1 -timeout 5m
git diff --check

jelias2

The recovery state machine introduced in this PR won't survive interop reorgs of any meaningful depth. Here's why:

The ceiling check on line 351 of monitor.go fires after recoveringWindowSize (5) polls and kills health if curUnsafeLag > 3 * unsafeInterval. In production, unsafeInterval=2s, so the ceiling is 6 seconds — 3 blocks. Any interop reorg deeper than that triggers an unhealthy verdict even if the sequencer is actively
rebuilding the chain.

The ceiling check also fires before the shrinking-lag check, so it doesn't matter if the sequencer is making progress. Once pollsInRecovery hits 5 and the lag is above 6 seconds, it's dead.

For interop reorgs, the chain rewinds to the parent of the invalid block — which could be minutes behind if cross-safe hasn't advanced recently. The sequencer then rebuilds blocks as fast as the engine allows, but it can't close a multi-minute gap down to 6 seconds within 5 polls. The conductor declares the
sequencer unhealthy and triggers a leader transfer — exactly the false positive this PR is trying to prevent.

Why "just increase the lag window" doesn't work

unsafeInterval controls both when recovery mode is entered (steady state detection) and the ceiling during recovery (3 * unsafeInterval). If you increase it to e.g. 10 minutes to tolerate deep reorgs, you also make the conductor blind to a genuinely dead sequencer for 10 minutes before it even starts noticing. That
defeats the purpose of the conductor.

wwared force-pushed the fix/conductor-interop-reorg-recovery branch 3 times, most recently from 0425fec to d13f29f Compare May 19, 2026 03:53

wwared marked this pull request as ready for review May 19, 2026 13:13

wwared requested a review from a team as a code owner May 19, 2026 13:13

wwared added 9 commits May 19, 2026 14:43

test(op-acceptance-tests): cover conductor reorg health

dcf2eea

fix(op-conductor): tolerate transient unsafe lag recovery

a48c530

fix(op-conductor): tolerate transient sync status gaps

bedcc7b

fix(op-conductor): require rolling unsafe lag recovery

7200b12

fix(op-conductor): restore peer count validation

55b5f6d

fix(op-conductor): add rolling health grace windows

bb50cb4

fix(op-devstack): wait for conductor sequencer startup

562d018

fix(op-devstack): match conductor health peers to min peer count

9a1e4bf

test(op-conductor): remove low-value health checks

7ac1628

wwared force-pushed the fix/conductor-interop-reorg-recovery branch from 1194b3b to 7ac1628 Compare May 19, 2026 14:44

jelias2 reviewed May 19, 2026

View reviewed changes

wwared added 3 commits May 19, 2026 17:22

fix(op-conductor): use dynamic unsafe recovery progress

eaea2c4

fix(devstack): use light CL sequencers for conductor supernode

06a5598

feat(op-conductor): add health debug metrics

ae475f8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(op-conductor): tolerate interop reorg recovery health#20817

fix(op-conductor): tolerate interop reorg recovery health#20817
wwared wants to merge 12 commits into
developfrom
fix/conductor-interop-reorg-recovery

wwared commented May 15, 2026 •

edited

Loading

Uh oh!

jelias2 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wwared commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Devstack changes

Tests

Verification

Uh oh!

jelias2 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wwared commented May 15, 2026 •

edited

Loading