Director: goroutine accumulation and client timeouts due to `context.Background()` in stat path

## Summary

**Trigger**: One or more origins or caches becoming slow enough that their stat requests consistently reach `Director_StatTimeout` (default: 2 s).

The director can enter a state where:

- Clients time out with `net/http: timeout awaiting response headers` on redirect requests.
- `GET /api/v1.0/director/directors` continues to return HTTP 200 throughout.
- The goroutine count trends upward from the first client error until the process is restarted.
- A restart clears the accumulated goroutines and temporarily restores throughput, but accumulation resumes once the director is back under load.

The root cause is that `generateAvailabilityMaps` in `director/stat.go` passes `context.Background()` into the stat path, causing `TryGoUntil` to block the Gin request handler goroutine indefinitely when a per-server stat errgroup is saturated. Under sustained load, goroutines accumulate until the process is restarted.

## Root cause

### The blocking call

`generateAvailabilityMaps` in `director/stat.go` issues the stat query with a non-cancellable context:

```go
// director/stat.go
qr := q.Query(context.Background(), reqPath, ...)  // ← should be the request context
```

That context propagates through `queryServersForObject` to each per-server call:

```go
statUtil.Errgroup.TryGoUntil(ctx, lookupFunc)  // ctx = context.Background()
```

`TryGoUntil` in `utils/errgroup.go` blocks until a semaphore slot is free or the context is cancelled:

```go
select {
case g.sem <- token{}:
    // acquired a slot
case <-ctx.Done():
    return false
}
```

`context.Background().Done()` returns a nil channel. In a Go `select`, a nil channel case is never selected. So with `context.Background()`, a goroutine waiting here for a slot can never exit early—not even after the client has disconnected.

### How the semaphore saturates

Each origin/cache has its own `statUtil` whose errgroup is capped at `Director_StatConcurrencyLimit` concurrent goroutines (default: 100). Against an unresponsive server, each goroutine holds its slot for the full `Director_StatTimeout` (default: 2 s), giving a maximum service rate of **100 / 2 s = 50 req/s** for that server. When the redirect request rate to namespaces served by the slow server exceeds that, new Gin goroutines queue up in `TryGoUntil` and cannot exit. The queue grows faster than it drains, and the goroutine count rises without a bound.

Because goroutines are queued waiting for a slot—not occupying one—each one sits in `TryGoUntil` longer than the 2 s per-stat timeout. Clients therefore hit their own `ResponseHeaderTimeout` (typically 10 s) waiting for the director to respond, even though the per-stat timeout is much shorter.

### Why `/api/v1.0/director/directors` keeps returning 200

`listDirectors` reads directly from the `directorAds` TTL cache in memory. It has no involvement with the stat path or the stat errgroup, so it remains responsive throughout.

### Why a restart is only a temporary reprieve

Restarting terminates all goroutines queued in `TryGoUntil`, resets every errgroup semaphore, and discards pending stat work, immediately restoring normal throughput. However, none of the root conditions change: the bug is still in the code, the unresponsive server is still unresponsive, and clients keep arriving. Goroutine accumulation resumes as soon as the director is back under load.

## Fix

Pass a request-scoped context into `q.Query()` in `generateAvailabilityMaps`:

```go
// director/stat.go
qr := q.Query(ctx.Request.Context(), reqPath, ...)  // was context.Background()
```

With a cancellable context, goroutines waiting in `TryGoUntil` can exit via `ctx.Done()` when the client disconnects or a deadline fires, rather than waiting indefinitely. Active stat HEAD requests will also cancel early via the propagated context. This breaks the feedback loop.

As a complementary hardening measure, add a hard deadline to the redirect handler so that total redirect latency is bounded even under pathological queuing pressure.

---

_([Brian A](https://github.com/brianaydemir): WIth a tip of the hat to Copilot.)_


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Director: goroutine accumulation and client timeouts due to `context.Background()` in stat path #3495

Summary

Root cause

The blocking call

How the semaphore saturates

Why `/api/v1.0/director/directors` keeps returning 200

Why a restart is only a temporary reprieve

Fix

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Director: goroutine accumulation and client timeouts due to context.Background() in stat path #3495

Description

Summary

Root cause

The blocking call

How the semaphore saturates

Why /api/v1.0/director/directors keeps returning 200

Why a restart is only a temporary reprieve

Fix

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Director: goroutine accumulation and client timeouts due to `context.Background()` in stat path #3495

Why `/api/v1.0/director/directors` keeps returning 200