-
Notifications
You must be signed in to change notification settings - Fork 697
feat: add health check/warmup to segmentwriter to reduce first-write latency #4453
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add health check/warmup to segmentwriter to reduce first-write latency #4453
Conversation
healthCheckCtx, cancel := context.WithTimeout(ctx, i.config.BucketHealthCheckTimeout) | ||
defer cancel() | ||
|
||
err := i.storageBucket.Iter(healthCheckCtx, "", func(string) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably quicker to do an Exists()/Get()/Attribures() and expect a not found.
What I am also unsure about if the bucket name is wrong, do we get a "not found" or another error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My idea here is to try and not depend on implementation specifics and whether the bucket is empty. To my best understanding, bucket not being present should emit a not found
error indeed, while any other scenario would result in successful request.
Speaking frankly, I am also not sure if the error is same, but this approach feels the most robust out of alternatives I've considered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the reasoning make sense, let's keep it like that.
In case of the bucket list being very long/expensive to gather, we could consider adding a prefix that likely would never exist, then it will always be expecting an empty reply.
Also for the functioning of the segment writer it is critical that we have the write permission, so another alternative, could be testing if an Upload is allowed.
// Perform bucket health check before ring registration to warm up the connection | ||
// and avoid slow first requests affecting p99 latency | ||
// On error, will emit a warning but continue startup | ||
_ = i.performBucketHealthCheck(ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the bigger question is how would we know that something is wrong, if we mark ourselves as ready/healthy anyhow. Maybe the right thing here is to exit with an error code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to take a cautious approach here to make sure I don't create an incident accidentally because I didn't consider something. As discussed over call, I'll add a TODO comment and an issue into board, so we can change it into a fatal error once we're sure this doesn't break PROD.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, let's add the TODO comment/issue and get this in
Uh oh!
There was an error while loading. Please reload this page.