fix(finops-v2): stale prom client cache race + multi-bucket efficiency merge by nadaverell · Pull Request #760 · skyhook-io/radar

nadaverell · 2026-05-23T23:04:58Z

Two follow-up fixes that landed too late to make it into #492 (squash-merged just before they were pushed). Both surfaced by cursor bugbot review on the original branch.

`internal/prometheus/client.go` — race in getPromClient cache

getPromClient originally snapshotted baseURL/basePath under RLock, built the transport outside the lock, then took the write lock to cache. A concurrent markConnected("B") could race the cache-set and leave a client bound to A as c.prom while c.baseURL was now B.

Simpler shape: skip the read-snapshot-then-build dance entirely. After the fast-path RLock miss, take the write lock and build from live state. The transport constructor is just struct-field assignments (no I/O) so holding the write lock across it is cheap, and the build always sees a consistent (baseURL, basePath, httpClient, headers) tuple.

`pkg/opencost/compute.go` — multi-bucket TotalEfficiency merge

ComputeCostSummary's merge across multi-window /allocation buckets copied the first bucket's allocation (cp := *a) which preserved its TotalEfficiency, then accumulated subsequent buckets' costs but never updated efficiency. Multi-bucket responses reported efficiency from only the first bucket.

Radar's current usage doesn't set Step on summary calls so this never triggers in-process, but the path is published library API and external consumers may use Step. Fix: weight TotalEfficiency by per-bucket allocation cost during the merge.

Test plan

go build ./... clean
go test ./internal/prometheus/... and go test ./pkg/opencost/... green

internal/prometheus/client.go: getPromClient's double-checked locking caches a stale client when baseURL changes during the lock-release window. After snapshotting (base, bp) under RLock, a concurrent markConnected("B") can update c.baseURL before we acquire the write lock to cache c.prom — but the existing check only verified c.prom == nil, not that c.baseURL still matched. Result: a client bound to A gets stored as c.prom while c.baseURL is now B, and the next request sends to A. Now also verify c.baseURL/c.basePath haven't changed before caching; if they have, discard our build and return nil so the next caller re-snapshots. pkg/opencost/compute.go: ComputeCostSummary's REST merge across multi-window buckets copied the first bucket's allocation via cp := *a (preserving its TotalEfficiency) and added subsequent bucket costs but never updated TotalEfficiency. Multi-bucket responses (rare in radar's current usage — Step isn't set on summary — but possible for library consumers) reported efficiency from only the first bucket. Now weight TotalEfficiency by per-bucket alloc cost during the merge. Both surfaced by cursor bugbot on commit 85012c6.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 2ad8045. Configure here.}

The previous fix for the stale-client cache returned nil when baseURL/basePath drifted between snapshot and write-lock. Cursor pointed out the downside: nil propagates to EnsureConnected (triggers a full discover() even though baseURL may still be valid) and to Prom() callers (handlers surface ReasonNoPrometheus without a retry on the same request). Simpler shape: skip the read-snapshot-then-build dance. After the fast path misses, take the write lock and build from live state. The transport constructor is just struct-field assignments so holding the write lock across it is cheap, and the build always sees a consistent (baseURL, basePath, httpClient, headers) tuple.

nadaverell requested a review from hisco as a code owner May 23, 2026 23:04

cursor Bot reviewed May 23, 2026

View reviewed changes

Comment thread internal/prometheus/client.go

nadaverell merged commit e4e13e6 into main May 23, 2026
8 checks passed

nadaverell deleted the fix/finops-v2-followup branch May 23, 2026 23:13

nadaverell restored the fix/finops-v2-followup branch May 23, 2026 23:17

nadaverell deleted the fix/finops-v2-followup branch May 23, 2026 23:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(finops-v2): stale prom client cache race + multi-bucket efficiency merge#760

fix(finops-v2): stale prom client cache race + multi-bucket efficiency merge#760
nadaverell merged 2 commits into
mainfrom
fix/finops-v2-followup

nadaverell commented May 23, 2026 •

edited

Loading

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nadaverell commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

internal/prometheus/client.go — race in getPromClient cache

pkg/opencost/compute.go — multi-bucket TotalEfficiency merge

Test plan

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

nadaverell commented May 23, 2026 •

edited

Loading

`internal/prometheus/client.go` — race in getPromClient cache

`pkg/opencost/compute.go` — multi-bucket TotalEfficiency merge