Skip to content

Stop iterating the live serverAds cache under its own lock#3513

Open
jhiemstrawisc wants to merge 2 commits into
PelicanPlatform:mainfrom
jhiemstrawisc:fix-serverad-ranging
Open

Stop iterating the live serverAds cache under its own lock#3513
jhiemstrawisc wants to merge 2 commits into
PelicanPlatform:mainfrom
jhiemstrawisc:fix-serverad-ranging

Conversation

@jhiemstrawisc

Copy link
Copy Markdown
Member

getAdsForPath walked serverAds with ttlcache's Range on every redirect. Range releases and re-acquires the cache's internal read lock around each element and leaks that lock if an entry is evicted or deleted mid-walk -- routine under registration churn plus TTL expiry. One leaked reader is fatal: the write-preferring RWMutex then blocks every reader and writer on the cache. And because updateDowntimeFromRegistry held the global filteredServersMutex while reading serverAds, that stall propagated into the registration path and piled up goroutines until the director was OOM-killed.

Fix the class of bug, not just the one call site:

  • Make getServerAdsSnapshot the single idiom for walking the cache, backed by the atomically-locked Items(). Callers iterate their own copy with no cache lock held, so a slow or misbehaving cache can't wedge them. Both the helper and the serverAds declaration document why Range must never be used, so a future performance tweak doesn't reintroduce it. Items() stays only where the keys or Item wrappers are actually needed.

  • Never hold filteredServersMutex across a serverAds access -- the two subsystems have independent locks and nesting them is an ABBA hazard. updateDowntimeFromRegistry now snapshots before locking.

The underlying Range lock-leak is an upstream ttlcache bug worth reporting separately; this change makes the director robust regardless.

@jhiemstrawisc jhiemstrawisc requested a review from h2zh June 11, 2026 16:20
@jhiemstrawisc jhiemstrawisc added bug Something isn't working critical High priority for next release labels Jun 11, 2026
@jhiemstrawisc

Copy link
Copy Markdown
Member Author

@h2zh and @turetske once this is approved/merged, it should be backported into 7.25 and 7.26. We suspect this is the cause of the goroutine buildups in the 7.25 ITB/OSDF Directors.

@jhiemstrawisc jhiemstrawisc force-pushed the fix-serverad-ranging branch from d64ba57 to 6f7e9ad Compare June 11, 2026 16:34
@h2zh h2zh added director Issue relating to the director component create-patch Patch this into multiple versions of Pelican labels Jun 11, 2026
@h2zh h2zh self-assigned this Jun 11, 2026

@h2zh h2zh left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a clean fix. It aligns with what I found in the ITB Director core dump (lock-coupling defect of serverAds) and furthur pinpoint the root cause.

Good to merge after the failed tests are resolved. We discussed this offline, one is caused by Harbor outage, but TestUpdateConfig_InvalidatesTokenCache seems to fail because of the code change in this PR. I try to run tests in /local_cache/cache_authz_test.go locally. This test succeeds when I run it in the main branch, but fails in this branch.

@h2zh h2zh assigned jhiemstrawisc and unassigned h2zh Jun 11, 2026
@jhiemstrawisc jhiemstrawisc force-pushed the fix-serverad-ranging branch from 6f7e9ad to 96cfc97 Compare June 11, 2026 19:27
@jhiemstrawisc

Copy link
Copy Markdown
Member Author

FYI, I rolled back the ttlcache library bump -- it caused failing tests, and after I started pulling on the thread I decided not to keep pulling. The fix here avoids the library's buggy Range() function.

getAdsForPath walked serverAds with ttlcache's Range on every redirect.
Range releases and re-acquires the cache's internal read lock around each
element and leaks that lock if an entry is evicted or deleted mid-walk --
routine under registration churn plus TTL expiry. One leaked reader is
fatal: the write-preferring RWMutex then blocks every reader and writer on
the cache. And because updateDowntimeFromRegistry held the global
filteredServersMutex while reading serverAds, that stall propagated into
the registration path and piled up goroutines until the director was
OOM-killed.

Fix the class of bug, not just the one call site:

- Make getServerAdsSnapshot the single idiom for walking the cache, backed
  by the atomically-locked Items(). Callers iterate their own copy with no
  cache lock held, so a slow or misbehaving cache can't wedge them. Both the
  helper and the serverAds declaration document why Range must never be used,
  so a future performance tweak doesn't reintroduce it. Items() stays only
  where the keys or Item wrappers are actually needed.

- Never hold filteredServersMutex across a serverAds access -- the two
  subsystems have independent locks and nesting them is an ABBA hazard.
  updateDowntimeFromRegistry now snapshots before locking.

The underlying Range lock-leak is an upstream ttlcache bug worth reporting
separately; this change makes the director robust regardless.

Rebase
@jhiemstrawisc jhiemstrawisc force-pushed the fix-serverad-ranging branch from 96cfc97 to eca4f46 Compare June 11, 2026 20:17
This is unrelated to the overall PR, but reflects changes in upstream build targets that are preventing container builds
@jhiemstrawisc jhiemstrawisc requested a review from h2zh June 11, 2026 21:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working create-patch Patch this into multiple versions of Pelican critical High priority for next release director Issue relating to the director component

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants