Stop iterating the live serverAds cache under its own lock#3513
Open
jhiemstrawisc wants to merge 2 commits into
Open
Stop iterating the live serverAds cache under its own lock#3513jhiemstrawisc wants to merge 2 commits into
jhiemstrawisc wants to merge 2 commits into
Conversation
Member
Author
d64ba57 to
6f7e9ad
Compare
h2zh
requested changes
Jun 11, 2026
h2zh
left a comment
Contributor
There was a problem hiding this comment.
This is a clean fix. It aligns with what I found in the ITB Director core dump (lock-coupling defect of serverAds) and furthur pinpoint the root cause.
Good to merge after the failed tests are resolved. We discussed this offline, one is caused by Harbor outage, but TestUpdateConfig_InvalidatesTokenCache seems to fail because of the code change in this PR. I try to run tests in /local_cache/cache_authz_test.go locally. This test succeeds when I run it in the main branch, but fails in this branch.
6f7e9ad to
96cfc97
Compare
Member
Author
|
FYI, I rolled back the ttlcache library bump -- it caused failing tests, and after I started pulling on the thread I decided not to keep pulling. The fix here avoids the library's buggy |
getAdsForPath walked serverAds with ttlcache's Range on every redirect. Range releases and re-acquires the cache's internal read lock around each element and leaks that lock if an entry is evicted or deleted mid-walk -- routine under registration churn plus TTL expiry. One leaked reader is fatal: the write-preferring RWMutex then blocks every reader and writer on the cache. And because updateDowntimeFromRegistry held the global filteredServersMutex while reading serverAds, that stall propagated into the registration path and piled up goroutines until the director was OOM-killed. Fix the class of bug, not just the one call site: - Make getServerAdsSnapshot the single idiom for walking the cache, backed by the atomically-locked Items(). Callers iterate their own copy with no cache lock held, so a slow or misbehaving cache can't wedge them. Both the helper and the serverAds declaration document why Range must never be used, so a future performance tweak doesn't reintroduce it. Items() stays only where the keys or Item wrappers are actually needed. - Never hold filteredServersMutex across a serverAds access -- the two subsystems have independent locks and nesting them is an ABBA hazard. updateDowntimeFromRegistry now snapshots before locking. The underlying Range lock-leak is an upstream ttlcache bug worth reporting separately; this change makes the director robust regardless. Rebase
96cfc97 to
eca4f46
Compare
This is unrelated to the overall PR, but reflects changes in upstream build targets that are preventing container builds
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
getAdsForPath walked serverAds with ttlcache's Range on every redirect. Range releases and re-acquires the cache's internal read lock around each element and leaks that lock if an entry is evicted or deleted mid-walk -- routine under registration churn plus TTL expiry. One leaked reader is fatal: the write-preferring RWMutex then blocks every reader and writer on the cache. And because updateDowntimeFromRegistry held the global filteredServersMutex while reading serverAds, that stall propagated into the registration path and piled up goroutines until the director was OOM-killed.
Fix the class of bug, not just the one call site:
Make getServerAdsSnapshot the single idiom for walking the cache, backed by the atomically-locked Items(). Callers iterate their own copy with no cache lock held, so a slow or misbehaving cache can't wedge them. Both the helper and the serverAds declaration document why Range must never be used, so a future performance tweak doesn't reintroduce it. Items() stays only where the keys or Item wrappers are actually needed.
Never hold filteredServersMutex across a serverAds access -- the two subsystems have independent locks and nesting them is an ABBA hazard. updateDowntimeFromRegistry now snapshots before locking.
The underlying Range lock-leak is an upstream ttlcache bug worth reporting separately; this change makes the director robust regardless.