fix(finops-v2): stale prom client cache race + multi-bucket efficiency merge#760
Merged
Conversation
internal/prometheus/client.go: getPromClient's double-checked locking
caches a stale client when baseURL changes during the lock-release
window. After snapshotting (base, bp) under RLock, a concurrent
markConnected("B") can update c.baseURL before we acquire the write
lock to cache c.prom — but the existing check only verified c.prom
== nil, not that c.baseURL still matched. Result: a client bound to
A gets stored as c.prom while c.baseURL is now B, and the next
request sends to A. Now also verify c.baseURL/c.basePath haven't
changed before caching; if they have, discard our build and return
nil so the next caller re-snapshots.
pkg/opencost/compute.go: ComputeCostSummary's REST merge across
multi-window buckets copied the first bucket's allocation via
cp := *a (preserving its TotalEfficiency) and added subsequent
bucket costs but never updated TotalEfficiency. Multi-bucket
responses (rare in radar's current usage — Step isn't set on summary
— but possible for library consumers) reported efficiency from only
the first bucket. Now weight TotalEfficiency by per-bucket alloc
cost during the merge.
Both surfaced by cursor bugbot on commit 85012c6.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 2ad8045. Configure here.
The previous fix for the stale-client cache returned nil when baseURL/basePath drifted between snapshot and write-lock. Cursor pointed out the downside: nil propagates to EnsureConnected (triggers a full discover() even though baseURL may still be valid) and to Prom() callers (handlers surface ReasonNoPrometheus without a retry on the same request). Simpler shape: skip the read-snapshot-then-build dance. After the fast path misses, take the write lock and build from live state. The transport constructor is just struct-field assignments so holding the write lock across it is cheap, and the build always sees a consistent (baseURL, basePath, httpClient, headers) tuple.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Two follow-up fixes that landed too late to make it into #492 (squash-merged just before they were pushed). Both surfaced by cursor bugbot review on the original branch.
internal/prometheus/client.go— race in getPromClient cachegetPromClientoriginally snapshottedbaseURL/basePathunder RLock, built the transport outside the lock, then took the write lock to cache. A concurrentmarkConnected("B")could race the cache-set and leave a client bound to A asc.promwhilec.baseURLwas now B.Simpler shape: skip the read-snapshot-then-build dance entirely. After the fast-path RLock miss, take the write lock and build from live state. The transport constructor is just struct-field assignments (no I/O) so holding the write lock across it is cheap, and the build always sees a consistent (baseURL, basePath, httpClient, headers) tuple.
pkg/opencost/compute.go— multi-bucket TotalEfficiency mergeComputeCostSummary's merge across multi-window/allocationbuckets copied the first bucket's allocation (cp := *a) which preserved itsTotalEfficiency, then accumulated subsequent buckets' costs but never updated efficiency. Multi-bucket responses reported efficiency from only the first bucket.Radar's current usage doesn't set
Stepon summary calls so this never triggers in-process, but the path is published library API and external consumers may useStep. Fix: weightTotalEfficiencyby per-bucket allocation cost during the merge.Test plan
go build ./...cleango test ./internal/prometheus/...andgo test ./pkg/opencost/...green