-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OP-Conductor health check ignores the OP_CONDUCTOR_HEALTHCHECK_UNSAFE_INTERVAL
configuration option
#14585
Comments
This isn't a bug, it is a feature of the health check, see code here
The log clearly indicates that it violates the first criteria, after 2s (time_diff), the block number still hasn't increased, thus it was seen as unhealthy. |
My current thinking is that there might be 2 ways to solve this
Both has its benefits and drawbacks, curious about your thoughts? also cc: @zhwrd My thought is that we could probably go with #1, with a recommendation to people that, if you want your unsafe_interval to be 60s, set your health check interval to be 60 or comparable as well |
I can see why its a bit confusing. Maybe we need a new config for UNSAFE_PROGRESSION_INTERVAL or something that lets you override the time_diff check since i don't think increasing the healthCheck interval is a great solution (it would mean you are slower to react to op-node/op-geth outages) |
The thing is that, if you're already allowing unsafe interval to be 60s, there's no point to react faster, right? (Although there are other safe / peer tests, but the most important check here is that unsafe head is progressing well within our tolerance.) Also if we think lower unsafe interval is important, I'd suggest that we remove check criteria #1 altogether, after giving it some thought, I feel that it provides not that much additional benefit as compared to pure unsafe interval. |
Yes this is my thought as well. In the PR I linked, it seems that check (1) was added as an afterthought in a commit titled "fix monitor bugs". I don't think this particular addition was tested and discussed thoroughly.
Raising the health check interval would mean we don't fail over on other health problems, e.g. if the node is totally unreachable. To be clear -- we are not using 60s in production, that was just to ensure the reproducibility of the bug.
Isn't that the same thing as All of that aside, there is a real problem here. The problem is that every time a new batch is submitted the active sequencer halts unsafe head production in order to perform derivation. If this derivation takes more than the block time (usually 2s) then there is a new leader elected. This is pointless, since all the sequencers are deriving at the same time and none of them is any more capable of progressing head than the others. With check (1) in place there is a new leader election on every batch submission! While I trust op-conductor to do leader failovers well, there is no way that this is the intended behavior of the software. There is absolutely a bug here and we should discuss further. |
Do you have any data that indicates it's due to batch processing time? From our experience at Base, batch processing is pretty quick (all in memory operations, that does not require read / write the trie db). I suspect the delay is more likely due to your database / disk configuration. |
Good question, appreciate you digging into that. Here is the timeline of events 17:12:37 Batch submitted as usual, containing roughly 5,400 blocks (this is a 1.5 hour max channel duration with a 1 second block time). During the 7 seconds between 17:12:37 and 17:12:44 we see 5,400 log lines of "generated attributes in payload queue". This is logged from TL;DR the batch takes 7 seconds to derive, which is greater than the block time so we hit this problem.
This is quite possible. One major difference is that I believe Base is running op-reth? We are running op-geth which IIRC is less efficient on DB ops. Another difference is the 2s block time vs 1s. However I suspect the reason you don't see this on Base is that you never submit a batch at the max-channel-duration or you have the duration set to a lower value so that this wouldn't be a problem anyway. In the case of this chain there is no actual usage yet and so we set the durations pretty high in order to save on un-necessary L1 posting costs.
Anything against doing this? We currently have to disable conductor on low throughput chains. You can keep your desired behavior by setting |
It's likely you need to tune your L1 cache settings on op-node for your low throughput chains. See these release notes for details. Regardless I tend to agree that for performance issues that are block specific (geth in particular can have long tail latencies on block building), we don't want to be needlessly changing leaders because it exceeded the block time so I'd be in favour of removing the first check and tuning your OP_CONDUCTOR_HEALTHCHECK_UNSAFE_INTERVAL to whatever you feel is a reasonable tradeoff. |
Seems like we've got an agreement, @brg8 do you mind creating a PR to remove that check? |
Yep I can do that 👍 |
Hello! I took a stab at this. I tested with an updated mock unit test which covers this. Happy to also test this manually but it'd save some time if a Docker image for this branch can be published with Optimism's CICD. Is that possible to do pre-merging? Thanks! |
have released this change in @dpulitano you should be able to pull the new image now, let us know how testing goes on your end. |
Thanks, I saw! Deployed it for an internal chain and it's looking alright. Will test on some more chains soon. I noticed there's a breaking change for us in this version by the way where we can longer specify the private IP address of op-conductor for
Do you know if this was intentional or a regression? I can open another issue if it doesn't seem intentional. The OP Conductor guide still shows using the private IP for |
Ah yeah that was an intentional change in 0.3 i think, will get the docs updated. |
Bug Description
The op-conductor health check ignores the
OP_CONDUCTOR_HEALTHCHECK_UNSAFE_INTERVAL
configuration option.Steps to Reproduce
To exacerbate the problem for reproducibility's sake start a local devnet with a 1 second block time and
OP_CONDUCTOR_HEALTHCHECK_UNSAFE_INTERVAL
set to 60 seconds. Give the batcher a max time of 1.5 hours.In the conductor logs you will find:
And then the conductor will select a new leader. You will notice that op-conductor fails over to new sequencers alarmingly often.
Expected behavior
There should be no sequencer failover.
In the example above op-conductor evaluates that it expected 1 block and got 0 so it should pick a new leader. This ignores the
OP_CONDUCTOR_HEALTHCHECK_UNSAFE_INTERVAL
option which should allow the sequencer to be up to 60 seconds behind.Environment Information:
Configurations:
OP_CONDUCTOR_HEALTHCHECK_INTERVAL: "1"
OP_CONDUCTOR_HEALTHCHECK_MIN_PEER_COUNT: "1"
OP_CONDUCTOR_HEALTHCHECK_UNSAFE_INTERVAL: "60"
Logs:
Additional context
This PR introduced the bug in the health check.
The text was updated successfully, but these errors were encountered: