Skip to content

GPU resource-control framework: pressure governor + workload placement#20

Merged
sunrunnerfire merged 4 commits into
mainfrom
feat/gpu-pressure-governor
Jun 8, 2026
Merged

GPU resource-control framework: pressure governor + workload placement#20
sunrunnerfire merged 4 commits into
mainfrom
feat/gpu-pressure-governor

Conversation

@sunrunnerfire

Copy link
Copy Markdown
Contributor

Adds a two-part resource-control framework so sustained inference on a single-GPU Mac can't starve the display compositor (WindowServer) into a kernel-watchdog panic, and so each workload runs on the device that actually serves it best.

Placement (where work runs) — internal/placement

Per-kind device policy gpu|cpu|auto, hardware-adaptive defaults, env overrides (QUENCHFORGE_PLACE_{CHAT,EMBED,CODE_EMBED,RERANK}).

  • On AMD-discrete (Vega II), chat and rerank default to CPU — measured single-request latency is far better on CPU than the AMD Metal path (chat ~4-5s vs ~27s; corroborated by bench-llama-sustained-load p50=29.4s), and it removes them from compositor GPU contention.
  • embed/code-embed default to GPU (the v0.8.0 batched-throughput win), and can run auto: the gateway routes each request by input count — single→CPU (latency), batched→GPU (throughput) — via a dual GPU+CPU instance.

Governor (how much GPU) — internal/pressure + internal/scheduler + gateway

Adaptive GPU admission: concurrency cap + duty-cycle idle gaps (the key lever — concurrency capping alone does not prevent starvation; sustained gapless command buffers do, at any concurrency). Driven by a display-power + memory-pressure sensor (full throughput when headless/asleep; reserve compositor headroom when a display is driven). Env: QUENCHFORGE_GOVERNOR, _GPU_CONCURRENCY_{MAX,DISPLAY_ACTIVE}, _GPU_DUTY_DISPLAY_ACTIVE, _GOVERNOR_MAX_COOLDOWN_MS.

Safety / compatibility

  • Dormant by default: no auto kind → no second instance, single-upstream path unchanged. Governor is a no-op when headless.
  • Single GPU-admission chokepoint (withGPUAdmission); CPU-placed kinds skip admission entirely.
  • All unit tests pass; gofmt + vet clean. Notes: Device Utilization % from ioreg is unreliable on this AMD card — validation uses xctrace Metal System Trace (display-surface-swap).

Validated locally on Mac Pro 2019 + Vega II: chat ~5.5x faster on CPU; governed display-on load held the compositor (no watchdog panic) where uncapped load crashed 4x.

On a single-GPU Mac, sustained back-to-back inference (eval suites, bulk
ingest, back-to-back RAG) can monopolize the Metal command queue long enough
that WindowServer misses its kernel watchdog check-ins and the machine panics
and reboots. Interactive use survives because it is bursty; sustained batch
load removes the idle gaps the compositor needs.

This adds a pressure governor that adapts the (previously unwired) admission
scheduler's concurrency ceiling to host signals:

- internal/pressure: a Sensor reading display power state (ioreg
  IODisplayWrangler CurrentPowerState vs MaxPowerState) and memory pressure
  (sysctl kern.memorystatus_vm_pressure_level), and a Limits.Target mapping —
  full throughput when headless or the display is asleep; a reduced ceiling
  while a screen is being driven (reserving GPU headroom for the compositor);
  further backoff on memory warn/critical. No CGo (shells out with timeouts);
  non-darwin builds report headless → full throughput.
- scheduler: SetConcurrency/Concurrency for runtime ceiling adjustment.
- gateway: a gated() middleware that runs the scheduler around every
  GPU-bound route — the single chokepoint covering both the reverse-proxy
  handlers and the Ollama-translation handlers (which forward via their own
  client). Chat is prioritized over batch embed/rerank; saturation returns
  503 + Retry-After. /health surfaces live governor state.
- config: QUENCHFORGE_GOVERNOR / _GPU_CONCURRENCY_MAX /
  _GPU_CONCURRENCY_DISPLAY_ACTIVE / _GOVERNOR_INTERVAL_MS (default on;
  6 headless, 2 display-active).

Tested: scheduler/pressure/gateway/config unit tests green; go build ./... and
all internal+cmd packages pass.
Concurrency capping alone does not prevent display-compositor (WindowServer)
starvation: sustained gapless GPU command buffers monopolize the Metal queue at
any concurrency because macOS AMD has no compositor preemption. The fix is
temporal — force GPU idle windows when a display is being driven.

- scheduler: SetDutyCycle/DutyCycle (governor knob, stored).
- gateway gated(): after each request, hold the slot idle for
  idle = busy*(1-d)/d (capped by GovernorMaxCooldownMS) so the GPU yields a
  window to the compositor before the next admission.
- pressure: Limits.For(reading) -> Plan{Concurrency, Duty}; display-active now
  serializes (conc 1) and applies the duty cycle; memory pressure tightens it;
  headless/asleep -> full throughput, duty 1.0 (no gaps, no regression).
- config: GPUDutyCycleDisplayActive (0.5), GovernorMaxCooldownMS (250);
  display-active concurrency default 2 -> 1.

Tested: duty Set/clamp, For() plan mapping, gated cooldown timing + cap +
no-cooldown-at-duty-1; go build + all internal/cmd packages pass.
Makes device placement a first-class, hardware-adaptive control plane rather
than a hardcoded --gpu-layers in tuning. Pairs with the admission/duty governor:
placement decides WHERE a workload runs; the governor decides HOW MUCH GPU it
gets. Together they are quenchforge's resource-control framework.

- internal/placement: Policy with per-kind modes gpu|cpu|auto, hardware-adaptive
  defaults (AMD-discrete: chat=cpu, embed/code-embed/rerank=gpu; non-AMD: all
  gpu), per-kind operator overrides, and RouteRequest(kind,batchN,threshold) for
  'auto' kinds (single->CPU latency, batched->GPU throughput).
- tuning.KernelParams now consults placement: CPU placement returns minimal CPU
  tuning (--gpu-layers 0, no Metal env, cache left on); GPU placement keeps the
  existing AMD safety tuning. PolicyFor() exposed for the gateway's future
  per-request auto-routing.
- config: QUENCHFORGE_PLACE_{CHAT,EMBED,CODE_EMBED,RERANK}=gpu|cpu|auto overrides.

Default effect on this Mac Pro + Vega II: chat -> CPU (measured ~7x faster than
the AMD Metal path; corroborated by bench-llama-sustained-load p50=29.4s) and
removed from display-compositor GPU contention; embed/code-embed/rerank stay on
GPU (v0.8.0 batched-throughput win). Tests cover defaults, overrides, auto
routing, and the GPU chat path under PlaceChat=gpu.
…nk CPU default

Completes the dynamic half of the placement framework. Embedding kinds can run
'auto': the gateway routes each request by input count — single (latency) to a
CPU instance, batched (throughput) to the GPU instance — getting the best of
both (quenchforge bench: GPU ~2.5x batched; CPU faster single-request).

- placement: AMD-discrete rerank default gpu->cpu (query-time/latency-bound,
  measured faster on CPU single-request; also removes it from GPU contention).
- tuning: KernelParamsForDevice(...,dev) for explicit per-instance device.
- gateway: cpuUpstreams + SetCPUUpstream + SetPlacement; routeEmbed() picks the
  instance by mode+batch; gated() refactored to withGPUAdmission and is now
  placement-aware (CPU-placed kinds skip GPU admission entirely); both embed
  handlers route per request and govern only the GPU branch.
- config: QUENCHFORGE_{EMBED,CODE_EMBED}_CPU_PORT + AUTO_BATCH_THRESHOLD.
- main: dual-launch (GPU primary + CPU instance) for 'auto' embed/code-embed;
  single-instance otherwise. SetPlacement wired from the host policy.

DORMANT by default: with no kind set to 'auto', no CPU instance launches and
routeEmbed falls through to the single upstream — behaviour identical to before.
All unit tests pass; new tests cover routeEmbed, countEmbedInputs, gated
placement-awareness, and KernelParamsForDevice.
@sunrunnerfire sunrunnerfire merged commit eecd39c into main Jun 8, 2026
6 checks passed
@sunrunnerfire sunrunnerfire deleted the feat/gpu-pressure-governor branch June 8, 2026 02:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant