Skip to content

fix(finops-v2): stale prom client cache race + multi-bucket efficiency merge#760

Merged
nadaverell merged 2 commits into
mainfrom
fix/finops-v2-followup
May 23, 2026
Merged

fix(finops-v2): stale prom client cache race + multi-bucket efficiency merge#760
nadaverell merged 2 commits into
mainfrom
fix/finops-v2-followup

Conversation

@nadaverell
Copy link
Copy Markdown
Contributor

@nadaverell nadaverell commented May 23, 2026

Two follow-up fixes that landed too late to make it into #492 (squash-merged just before they were pushed). Both surfaced by cursor bugbot review on the original branch.

internal/prometheus/client.go — race in getPromClient cache

getPromClient originally snapshotted baseURL/basePath under RLock, built the transport outside the lock, then took the write lock to cache. A concurrent markConnected("B") could race the cache-set and leave a client bound to A as c.prom while c.baseURL was now B.

Simpler shape: skip the read-snapshot-then-build dance entirely. After the fast-path RLock miss, take the write lock and build from live state. The transport constructor is just struct-field assignments (no I/O) so holding the write lock across it is cheap, and the build always sees a consistent (baseURL, basePath, httpClient, headers) tuple.

pkg/opencost/compute.go — multi-bucket TotalEfficiency merge

ComputeCostSummary's merge across multi-window /allocation buckets copied the first bucket's allocation (cp := *a) which preserved its TotalEfficiency, then accumulated subsequent buckets' costs but never updated efficiency. Multi-bucket responses reported efficiency from only the first bucket.

Radar's current usage doesn't set Step on summary calls so this never triggers in-process, but the path is published library API and external consumers may use Step. Fix: weight TotalEfficiency by per-bucket allocation cost during the merge.

Test plan

  • go build ./... clean
  • go test ./internal/prometheus/... and go test ./pkg/opencost/... green

internal/prometheus/client.go: getPromClient's double-checked locking
caches a stale client when baseURL changes during the lock-release
window. After snapshotting (base, bp) under RLock, a concurrent
markConnected("B") can update c.baseURL before we acquire the write
lock to cache c.prom — but the existing check only verified c.prom
== nil, not that c.baseURL still matched. Result: a client bound to
A gets stored as c.prom while c.baseURL is now B, and the next
request sends to A. Now also verify c.baseURL/c.basePath haven't
changed before caching; if they have, discard our build and return
nil so the next caller re-snapshots.

pkg/opencost/compute.go: ComputeCostSummary's REST merge across
multi-window buckets copied the first bucket's allocation via
cp := *a (preserving its TotalEfficiency) and added subsequent
bucket costs but never updated TotalEfficiency. Multi-bucket
responses (rare in radar's current usage — Step isn't set on summary
— but possible for library consumers) reported efficiency from only
the first bucket. Now weight TotalEfficiency by per-bucket alloc
cost during the merge.

Both surfaced by cursor bugbot on commit 85012c6.
@nadaverell nadaverell requested a review from hisco as a code owner May 23, 2026 23:04
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 2ad8045. Configure here.

Comment thread internal/prometheus/client.go
The previous fix for the stale-client cache returned nil when
baseURL/basePath drifted between snapshot and write-lock. Cursor
pointed out the downside: nil propagates to EnsureConnected (triggers
a full discover() even though baseURL may still be valid) and to
Prom() callers (handlers surface ReasonNoPrometheus without a retry
on the same request).

Simpler shape: skip the read-snapshot-then-build dance. After the
fast path misses, take the write lock and build from live state. The
transport constructor is just struct-field assignments so holding
the write lock across it is cheap, and the build always sees a
consistent (baseURL, basePath, httpClient, headers) tuple.
@nadaverell nadaverell merged commit e4e13e6 into main May 23, 2026
8 checks passed
@nadaverell nadaverell deleted the fix/finops-v2-followup branch May 23, 2026 23:13
@nadaverell nadaverell restored the fix/finops-v2-followup branch May 23, 2026 23:17
@nadaverell nadaverell deleted the fix/finops-v2-followup branch May 23, 2026 23:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant