Summary
Trigger: One or more origins or caches becoming slow enough that their stat requests consistently reach Director_StatTimeout (default: 2 s).
The director can enter a state where:
- Clients time out with
net/http: timeout awaiting response headers on redirect requests.
GET /api/v1.0/director/directors continues to return HTTP 200 throughout.
- The goroutine count trends upward from the first client error until the process is restarted.
- A restart clears the accumulated goroutines and temporarily restores throughput, but accumulation resumes once the director is back under load.
The root cause is that generateAvailabilityMaps in director/stat.go passes context.Background() into the stat path, causing TryGoUntil to block the Gin request handler goroutine indefinitely when a per-server stat errgroup is saturated. Under sustained load, goroutines accumulate until the process is restarted.
Root cause
The blocking call
generateAvailabilityMaps in director/stat.go issues the stat query with a non-cancellable context:
// director/stat.go
qr := q.Query(context.Background(), reqPath, ...) // ← should be the request context
That context propagates through queryServersForObject to each per-server call:
statUtil.Errgroup.TryGoUntil(ctx, lookupFunc) // ctx = context.Background()
TryGoUntil in utils/errgroup.go blocks until a semaphore slot is free or the context is cancelled:
select {
case g.sem <- token{}:
// acquired a slot
case <-ctx.Done():
return false
}
context.Background().Done() returns a nil channel. In a Go select, a nil channel case is never selected. So with context.Background(), a goroutine waiting here for a slot can never exit early—not even after the client has disconnected.
How the semaphore saturates
Each origin/cache has its own statUtil whose errgroup is capped at Director_StatConcurrencyLimit concurrent goroutines (default: 100). Against an unresponsive server, each goroutine holds its slot for the full Director_StatTimeout (default: 2 s), giving a maximum service rate of 100 / 2 s = 50 req/s for that server. When the redirect request rate to namespaces served by the slow server exceeds that, new Gin goroutines queue up in TryGoUntil and cannot exit. The queue grows faster than it drains, and the goroutine count rises without a bound.
Because goroutines are queued waiting for a slot—not occupying one—each one sits in TryGoUntil longer than the 2 s per-stat timeout. Clients therefore hit their own ResponseHeaderTimeout (typically 10 s) waiting for the director to respond, even though the per-stat timeout is much shorter.
Why /api/v1.0/director/directors keeps returning 200
listDirectors reads directly from the directorAds TTL cache in memory. It has no involvement with the stat path or the stat errgroup, so it remains responsive throughout.
Why a restart is only a temporary reprieve
Restarting terminates all goroutines queued in TryGoUntil, resets every errgroup semaphore, and discards pending stat work, immediately restoring normal throughput. However, none of the root conditions change: the bug is still in the code, the unresponsive server is still unresponsive, and clients keep arriving. Goroutine accumulation resumes as soon as the director is back under load.
Fix
Pass a request-scoped context into q.Query() in generateAvailabilityMaps:
// director/stat.go
qr := q.Query(ctx.Request.Context(), reqPath, ...) // was context.Background()
With a cancellable context, goroutines waiting in TryGoUntil can exit via ctx.Done() when the client disconnects or a deadline fires, rather than waiting indefinitely. Active stat HEAD requests will also cancel early via the propagated context. This breaks the feedback loop.
As a complementary hardening measure, add a hard deadline to the redirect handler so that total redirect latency is bounded even under pathological queuing pressure.
(Brian A: WIth a tip of the hat to Copilot.)
Summary
Trigger: One or more origins or caches becoming slow enough that their stat requests consistently reach
Director_StatTimeout(default: 2 s).The director can enter a state where:
net/http: timeout awaiting response headerson redirect requests.GET /api/v1.0/director/directorscontinues to return HTTP 200 throughout.The root cause is that
generateAvailabilityMapsindirector/stat.gopassescontext.Background()into the stat path, causingTryGoUntilto block the Gin request handler goroutine indefinitely when a per-server stat errgroup is saturated. Under sustained load, goroutines accumulate until the process is restarted.Root cause
The blocking call
generateAvailabilityMapsindirector/stat.goissues the stat query with a non-cancellable context:That context propagates through
queryServersForObjectto each per-server call:TryGoUntilinutils/errgroup.goblocks until a semaphore slot is free or the context is cancelled:context.Background().Done()returns a nil channel. In a Goselect, a nil channel case is never selected. So withcontext.Background(), a goroutine waiting here for a slot can never exit early—not even after the client has disconnected.How the semaphore saturates
Each origin/cache has its own
statUtilwhose errgroup is capped atDirector_StatConcurrencyLimitconcurrent goroutines (default: 100). Against an unresponsive server, each goroutine holds its slot for the fullDirector_StatTimeout(default: 2 s), giving a maximum service rate of 100 / 2 s = 50 req/s for that server. When the redirect request rate to namespaces served by the slow server exceeds that, new Gin goroutines queue up inTryGoUntiland cannot exit. The queue grows faster than it drains, and the goroutine count rises without a bound.Because goroutines are queued waiting for a slot—not occupying one—each one sits in
TryGoUntillonger than the 2 s per-stat timeout. Clients therefore hit their ownResponseHeaderTimeout(typically 10 s) waiting for the director to respond, even though the per-stat timeout is much shorter.Why
/api/v1.0/director/directorskeeps returning 200listDirectorsreads directly from thedirectorAdsTTL cache in memory. It has no involvement with the stat path or the stat errgroup, so it remains responsive throughout.Why a restart is only a temporary reprieve
Restarting terminates all goroutines queued in
TryGoUntil, resets every errgroup semaphore, and discards pending stat work, immediately restoring normal throughput. However, none of the root conditions change: the bug is still in the code, the unresponsive server is still unresponsive, and clients keep arriving. Goroutine accumulation resumes as soon as the director is back under load.Fix
Pass a request-scoped context into
q.Query()ingenerateAvailabilityMaps:With a cancellable context, goroutines waiting in
TryGoUntilcan exit viactx.Done()when the client disconnects or a deadline fires, rather than waiting indefinitely. Active stat HEAD requests will also cancel early via the propagated context. This breaks the feedback loop.As a complementary hardening measure, add a hard deadline to the redirect handler so that total redirect latency is bounded even under pathological queuing pressure.
(Brian A: WIth a tip of the hat to Copilot.)