Skip to content

Conversation

oleg-kozlyuk-grafana
Copy link
Contributor

@oleg-kozlyuk-grafana oleg-kozlyuk-grafana commented Sep 23, 2025

  • Adds dual-purpose check that:
    • Establishes connection to object store before reporting ready to reduce latency
    • Logs a warning if there is an issue with credentials

@oleg-kozlyuk-grafana oleg-kozlyuk-grafana marked this pull request as ready for review September 23, 2025 12:43
healthCheckCtx, cancel := context.WithTimeout(ctx, i.config.BucketHealthCheckTimeout)
defer cancel()

err := i.storageBucket.Iter(healthCheckCtx, "", func(string) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably quicker to do an Exists()/Get()/Attribures() and expect a not found.

What I am also unsure about if the bucket name is wrong, do we get a "not found" or another error?

Copy link
Contributor Author

@oleg-kozlyuk-grafana oleg-kozlyuk-grafana Sep 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My idea here is to try and not depend on implementation specifics and whether the bucket is empty. To my best understanding, bucket not being present should emit a not found error indeed, while any other scenario would result in successful request.

Speaking frankly, I am also not sure if the error is same, but this approach feels the most robust out of alternatives I've considered.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the reasoning make sense, let's keep it like that.

In case of the bucket list being very long/expensive to gather, we could consider adding a prefix that likely would never exist, then it will always be expecting an empty reply.

Also for the functioning of the segment writer it is critical that we have the write permission, so another alternative, could be testing if an Upload is allowed.

// Perform bucket health check before ring registration to warm up the connection
// and avoid slow first requests affecting p99 latency
// On error, will emit a warning but continue startup
_ = i.performBucketHealthCheck(ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the bigger question is how would we know that something is wrong, if we mark ourselves as ready/healthy anyhow. Maybe the right thing here is to exit with an error code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to take a cautious approach here to make sure I don't create an incident accidentally because I didn't consider something. As discussed over call, I'll add a TODO comment and an issue into board, so we can change it into a fatal error once we're sure this doesn't break PROD.

Copy link
Contributor

@simonswine simonswine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, let's add the TODO comment/issue and get this in

@oleg-kozlyuk-grafana oleg-kozlyuk-grafana merged commit a077234 into grafana:main Oct 9, 2025
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants