Skip to content

Director: goroutine accumulation and client timeouts due to context.Background() in stat path #3495

@brianaydemir

Description

@brianaydemir

Summary

Trigger: One or more origins or caches becoming slow enough that their stat requests consistently reach Director_StatTimeout (default: 2 s).

The director can enter a state where:

  • Clients time out with net/http: timeout awaiting response headers on redirect requests.
  • GET /api/v1.0/director/directors continues to return HTTP 200 throughout.
  • The goroutine count trends upward from the first client error until the process is restarted.
  • A restart clears the accumulated goroutines and temporarily restores throughput, but accumulation resumes once the director is back under load.

The root cause is that generateAvailabilityMaps in director/stat.go passes context.Background() into the stat path, causing TryGoUntil to block the Gin request handler goroutine indefinitely when a per-server stat errgroup is saturated. Under sustained load, goroutines accumulate until the process is restarted.

Root cause

The blocking call

generateAvailabilityMaps in director/stat.go issues the stat query with a non-cancellable context:

// director/stat.go
qr := q.Query(context.Background(), reqPath, ...)  // ← should be the request context

That context propagates through queryServersForObject to each per-server call:

statUtil.Errgroup.TryGoUntil(ctx, lookupFunc)  // ctx = context.Background()

TryGoUntil in utils/errgroup.go blocks until a semaphore slot is free or the context is cancelled:

select {
case g.sem <- token{}:
    // acquired a slot
case <-ctx.Done():
    return false
}

context.Background().Done() returns a nil channel. In a Go select, a nil channel case is never selected. So with context.Background(), a goroutine waiting here for a slot can never exit early—not even after the client has disconnected.

How the semaphore saturates

Each origin/cache has its own statUtil whose errgroup is capped at Director_StatConcurrencyLimit concurrent goroutines (default: 100). Against an unresponsive server, each goroutine holds its slot for the full Director_StatTimeout (default: 2 s), giving a maximum service rate of 100 / 2 s = 50 req/s for that server. When the redirect request rate to namespaces served by the slow server exceeds that, new Gin goroutines queue up in TryGoUntil and cannot exit. The queue grows faster than it drains, and the goroutine count rises without a bound.

Because goroutines are queued waiting for a slot—not occupying one—each one sits in TryGoUntil longer than the 2 s per-stat timeout. Clients therefore hit their own ResponseHeaderTimeout (typically 10 s) waiting for the director to respond, even though the per-stat timeout is much shorter.

Why /api/v1.0/director/directors keeps returning 200

listDirectors reads directly from the directorAds TTL cache in memory. It has no involvement with the stat path or the stat errgroup, so it remains responsive throughout.

Why a restart is only a temporary reprieve

Restarting terminates all goroutines queued in TryGoUntil, resets every errgroup semaphore, and discards pending stat work, immediately restoring normal throughput. However, none of the root conditions change: the bug is still in the code, the unresponsive server is still unresponsive, and clients keep arriving. Goroutine accumulation resumes as soon as the director is back under load.

Fix

Pass a request-scoped context into q.Query() in generateAvailabilityMaps:

// director/stat.go
qr := q.Query(ctx.Request.Context(), reqPath, ...)  // was context.Background()

With a cancellable context, goroutines waiting in TryGoUntil can exit via ctx.Done() when the client disconnects or a deadline fires, rather than waiting indefinitely. Active stat HEAD requests will also cancel early via the propagated context. This breaks the feedback loop.

As a complementary hardening measure, add a hard deadline to the redirect handler so that total redirect latency is bounded even under pathological queuing pressure.


(Brian A: WIth a tip of the hat to Copilot.)

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingdirectorIssue relating to the director component

Type

No fields configured for Bug.

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions