Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OP-Conductor health check ignores the OP_CONDUCTOR_HEALTHCHECK_UNSAFE_INTERVAL configuration option #14585

Closed
brg8 opened this issue Feb 28, 2025 · 14 comments · Fixed by #14655
Closed
Labels
T-platforms Team: Platforms

Comments

@brg8
Copy link

brg8 commented Feb 28, 2025

Bug Description

The op-conductor health check ignores the OP_CONDUCTOR_HEALTHCHECK_UNSAFE_INTERVAL configuration option.

Steps to Reproduce

To exacerbate the problem for reproducibility's sake start a local devnet with a 1 second block time and OP_CONDUCTOR_HEALTHCHECK_UNSAFE_INTERVAL set to 60 seconds. Give the batcher a max time of 1.5 hours.

In the conductor logs you will find:

t=2025-02-27T22:12:39+0000 lvl=error msg="unsafe head is not progressing as expected" now=1740694359 unsafe_head_num=522782 last_seen_unsafe_num=522782 last_seen_unsafe_time=1740694357 unsafe_interval=60 time_diff=2 block_diff=0 expected_blocks=1

And then the conductor will select a new leader. You will notice that op-conductor fails over to new sequencers alarmingly often.

Expected behavior

There should be no sequencer failover.

In the example above op-conductor evaluates that it expected 1 block and got 0 so it should pick a new leader. This ignores the OP_CONDUCTOR_HEALTHCHECK_UNSAFE_INTERVAL option which should allow the sequencer to be up to 60 seconds behind.

Environment Information:

  • Operating System: Ubuntu 20.04.6 LTS (Focal Fossa)
  • Package Version (or commit hash): op-node v1.10.2, op-conductor v0.2.0

Configurations:

OP_CONDUCTOR_HEALTHCHECK_INTERVAL: "1"
OP_CONDUCTOR_HEALTHCHECK_MIN_PEER_COUNT: "1"
OP_CONDUCTOR_HEALTHCHECK_UNSAFE_INTERVAL: "60"

Logs:

t=2025-02-27T22:12:39+0000 lvl=error msg="unsafe head is not progressing as expected" now=1740694359 unsafe_head_num=522782 last_seen_unsafe_num=522782 last_seen_unsafe_time=1740694357 unsafe_interval=60 time_diff=2 block_diff=0 expected_blocks=1

Additional context

This PR introduced the bug in the health check.

@alfonso-op alfonso-op added the T-platforms Team: Platforms label Feb 28, 2025
@0x00101010
Copy link
Collaborator

0x00101010 commented Feb 28, 2025

This isn't a bug, it is a feature of the health check, see code here

// healthCheck checks the health of the sequencer by 4 criteria:
// 1. unsafe head is progressing per block time
// 2. unsafe head is not too far behind now (measured by unsafeInterval)
// 3. safe head is progressing every configured batch submission interval
// 4. peer count is above the configured minimum

The log clearly indicates that it violates the first criteria, after 2s (time_diff), the block number still hasn't increased, thus it was seen as unhealthy.

@0x00101010
Copy link
Collaborator

0x00101010 commented Feb 28, 2025

My current thinking is that there might be 2 ways to solve this

  1. increase your health check interval, so that in this way, instead of requiring a block being produced within 2s, say if you change interval to 5s, then it changed to requiring 4 blocks to be produced within 5s, which might give your sequencer time to catch up
  2. remove this health check criteria, and only use unsafe head check to check sequencer health

Both has its benefits and drawbacks, curious about your thoughts? also cc: @zhwrd

My thought is that we could probably go with #1, with a recommendation to people that, if you want your unsafe_interval to be 60s, set your health check interval to be 60 or comparable as well

@zhwrd
Copy link
Contributor

zhwrd commented Feb 28, 2025

I can see why its a bit confusing. Maybe we need a new config for UNSAFE_PROGRESSION_INTERVAL or something that lets you override the time_diff check since i don't think increasing the healthCheck interval is a great solution (it would mean you are slower to react to op-node/op-geth outages)

@0x00101010
Copy link
Collaborator

0x00101010 commented Feb 28, 2025

I can see why its a bit confusing. Maybe we need a new config for UNSAFE_PROGRESSION_INTERVAL or something that lets you override the time_diff check since i don't think increasing the healthCheck interval is a great solution (it would mean you are slower to react to op-node/op-geth outages)

The thing is that, if you're already allowing unsafe interval to be 60s, there's no point to react faster, right? (Although there are other safe / peer tests, but the most important check here is that unsafe head is progressing well within our tolerance.)

Also if we think lower unsafe interval is important, I'd suggest that we remove check criteria #1 altogether, after giving it some thought, I feel that it provides not that much additional benefit as compared to pure unsafe interval.

@brg8
Copy link
Author

brg8 commented Mar 1, 2025

I'd suggest that we remove check criteria #1 altogether, after giving it some thought, I feel that it provides not that much additional benefit as compared to pure unsafe interval.

Yes this is my thought as well. In the PR I linked, it seems that check (1) was added as an afterthought in a commit titled "fix monitor bugs". I don't think this particular addition was tested and discussed thoroughly.

The thing is that, if you're already allowing unsafe interval to be 60s, there's no point to react faster, right?

Raising the health check interval would mean we don't fail over on other health problems, e.g. if the node is totally unreachable. To be clear -- we are not using 60s in production, that was just to ensure the reproducibility of the bug.

Maybe we need a new config for UNSAFE_PROGRESSION_INTERVAL or something that lets you override the time_diff check

Isn't that the same thing as OP_CONDUCTOR_HEALTHCHECK_UNSAFE_INTERVAL * BLOCK_TIME?

All of that aside, there is a real problem here. The problem is that every time a new batch is submitted the active sequencer halts unsafe head production in order to perform derivation. If this derivation takes more than the block time (usually 2s) then there is a new leader elected. This is pointless, since all the sequencers are deriving at the same time and none of them is any more capable of progressing head than the others. With check (1) in place there is a new leader election on every batch submission! While I trust op-conductor to do leader failovers well, there is no way that this is the intended behavior of the software. There is absolutely a bug here and we should discuss further.

@0x00101010
Copy link
Collaborator

I'd suggest that we remove check criteria #1 altogether, after giving it some thought, I feel that it provides not that much additional benefit as compared to pure unsafe interval.

Yes this is my thought as well. In the PR I linked, it seems that check (1) was added as an afterthought in a commit titled "fix monitor bugs". I don't think this particular addition was tested and discussed thoroughly.

The thing is that, if you're already allowing unsafe interval to be 60s, there's no point to react faster, right?

Raising the health check interval would mean we don't fail over on other health problems, e.g. if the node is totally unreachable. To be clear -- we are not using 60s in production, that was just to ensure the reproducibility of the bug.

Maybe we need a new config for UNSAFE_PROGRESSION_INTERVAL or something that lets you override the time_diff check

Isn't that the same thing as OP_CONDUCTOR_HEALTHCHECK_UNSAFE_INTERVAL * BLOCK_TIME?

All of that aside, there is a real problem here. The problem is that every time a new batch is submitted the active sequencer halts unsafe head production in order to perform derivation. If this derivation takes more than the block time (usually 2s) then there is a new leader elected. This is pointless, since all the sequencers are deriving at the same time and none of them is any more capable of progressing head than the others. With check (1) in place there is a new leader election on every batch submission! While I trust op-conductor to do leader failovers well, there is no way that this is the intended behavior of the software. There is absolutely a bug here and we should discuss further.

Do you have any data that indicates it's due to batch processing time? From our experience at Base, batch processing is pretty quick (all in memory operations, that does not require read / write the trie db). I suspect the delay is more likely due to your database / disk configuration.

@brg8
Copy link
Author

brg8 commented Mar 2, 2025

Do you have any data that indicates it's due to batch processing time?

Good question, appreciate you digging into that. Here is the timeline of events

17:12:37 Batch submitted as usual, containing roughly 5,400 blocks (this is a 1.5 hour max channel duration with a 1 second block time).
17:12:37 Sequencer fetches the batch and starts to createNextAttributes on the batch.
17:12:39 Conductor for the active sequencer logs "unsafe head is not progressing as expected" and fails over.
17:12:44 Sequencer finishes creating attributes and logs "sequencer paused until new events".

During the 7 seconds between 17:12:37 and 17:12:44 we see 5,400 log lines of "generated attributes in payload queue". This is logged from createNextAttributes which intentionally blocks sequencing while deriving the batch.

TL;DR the batch takes 7 seconds to derive, which is greater than the block time so we hit this problem.

I suspect the delay is more likely due to your database / disk configuration.

This is quite possible. One major difference is that I believe Base is running op-reth? We are running op-geth which IIRC is less efficient on DB ops. Another difference is the 2s block time vs 1s.

However I suspect the reason you don't see this on Base is that you never submit a batch at the max-channel-duration or you have the duration set to a lower value so that this wouldn't be a problem anyway. In the case of this chain there is no actual usage yet and so we set the durations pretty high in order to save on un-necessary L1 posting costs.

I'd suggest that we remove check criteria #1 altogether, after giving it some thought, I feel that it provides not that much additional benefit as compared to pure unsafe interval.

Anything against doing this? We currently have to disable conductor on low throughput chains. You can keep your desired behavior by setting OP_CONDUCTOR_HEALTHCHECK_UNSAFE_INTERVAL equal to your block time (2s), but there is no way for us to get our desired behavior with check (1) in place.

@zhwrd
Copy link
Contributor

zhwrd commented Mar 3, 2025

Do you have any data that indicates it's due to batch processing time?

Good question, appreciate you digging into that. Here is the timeline of events

17:12:37 Batch submitted as usual, containing roughly 5,400 blocks (this is a 1.5 hour max channel duration with a 1 second block time). 17:12:37 Sequencer fetches the batch and starts to createNextAttributes on the batch. 17:12:39 Conductor for the active sequencer logs "unsafe head is not progressing as expected" and fails over. 17:12:44 Sequencer finishes creating attributes and logs "sequencer paused until new events".

During the 7 seconds between 17:12:37 and 17:12:44 we see 5,400 log lines of "generated attributes in payload queue". This is logged from createNextAttributes which intentionally blocks sequencing while deriving the batch.

TL;DR the batch takes 7 seconds to derive, which is greater than the block time so we hit this problem.

I suspect the delay is more likely due to your database / disk configuration.

This is quite possible. One major difference is that I believe Base is running op-reth? We are running op-geth which IIRC is less efficient on DB ops. Another difference is the 2s block time vs 1s.

It's likely you need to tune your L1 cache settings on op-node for your low throughput chains. See these release notes for details.

Regardless I tend to agree that for performance issues that are block specific (geth in particular can have long tail latencies on block building), we don't want to be needlessly changing leaders because it exceeded the block time so I'd be in favour of removing the first check and tuning your OP_CONDUCTOR_HEALTHCHECK_UNSAFE_INTERVAL to whatever you feel is a reasonable tradeoff.

@0x00101010
Copy link
Collaborator

Seems like we've got an agreement, @brg8 do you mind creating a PR to remove that check?

@brg8
Copy link
Author

brg8 commented Mar 3, 2025

Yep I can do that 👍

@dpulitano
Copy link
Contributor

Hello! I took a stab at this.

I tested with an updated mock unit test which covers this. Happy to also test this manually but it'd save some time if a Docker image for this branch can be published with Optimism's CICD. Is that possible to do pre-merging? Thanks!

@zhwrd
Copy link
Contributor

zhwrd commented Mar 6, 2025

have released this change in op-conductor/v0.3.2-rc.1, we'll do some internal testing and then finalize the release.

@dpulitano you should be able to pull the new image now, let us know how testing goes on your end.

@dpulitano
Copy link
Contributor

Thanks, I saw! Deployed it for an internal chain and it's looking alright. Will test on some more chains soon.

I noticed there's a breaking change for us in this version by the way where we can longer specify the private IP address of op-conductor for OP_CONDUCTOR_CONSENSUS_ADDR. I had to update it to be 0.0.0.0 and then set OP_CONDUCTOR_CONSENSUS_ADVERTISED as <private ip>:<consensus port> otherwise I receive:

t=2025-03-06T18:26:01+0000 lvl=crit msg="Application failed" message="failed to setup: failed to create conductor: failed to initialize consensus: failed to create raft consensus: failed to create raft tcp transport: listen tcp <redacted OP_CONDUCTOR_CONSENSUS_ADDR IP>:50050: bind: cannot assign requested address"

Do you know if this was intentional or a regression? I can open another issue if it doesn't seem intentional. The OP Conductor guide still shows using the private IP for OP_CONDUCTOR_CONSENSUS_ADDR -- we should update that to reflect the required if this was indeed intentional

@zhwrd
Copy link
Contributor

zhwrd commented Mar 6, 2025

Ah yeah that was an intentional change in 0.3 i think, will get the docs updated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T-platforms Team: Platforms
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants