GPU resource-control framework: pressure governor + workload placement by sunrunnerfire · Pull Request #20 · Cerid-AI/quenchforge

sunrunnerfire · 2026-06-08T01:53:25Z

Adds a two-part resource-control framework so sustained inference on a single-GPU Mac can't starve the display compositor (WindowServer) into a kernel-watchdog panic, and so each workload runs on the device that actually serves it best.

Placement (where work runs) — `internal/placement`

Per-kind device policy gpu|cpu|auto, hardware-adaptive defaults, env overrides (QUENCHFORGE_PLACE_{CHAT,EMBED,CODE_EMBED,RERANK}).

On AMD-discrete (Vega II), chat and rerank default to CPU — measured single-request latency is far better on CPU than the AMD Metal path (chat ~4-5s vs ~27s; corroborated by bench-llama-sustained-load p50=29.4s), and it removes them from compositor GPU contention.
embed/code-embed default to GPU (the v0.8.0 batched-throughput win), and can run auto: the gateway routes each request by input count — single→CPU (latency), batched→GPU (throughput) — via a dual GPU+CPU instance.

Governor (how much GPU) — `internal/pressure` + `internal/scheduler` + gateway

Adaptive GPU admission: concurrency cap + duty-cycle idle gaps (the key lever — concurrency capping alone does not prevent starvation; sustained gapless command buffers do, at any concurrency). Driven by a display-power + memory-pressure sensor (full throughput when headless/asleep; reserve compositor headroom when a display is driven). Env: QUENCHFORGE_GOVERNOR, _GPU_CONCURRENCY_{MAX,DISPLAY_ACTIVE}, _GPU_DUTY_DISPLAY_ACTIVE, _GOVERNOR_MAX_COOLDOWN_MS.

Safety / compatibility

Dormant by default: no auto kind → no second instance, single-upstream path unchanged. Governor is a no-op when headless.
Single GPU-admission chokepoint (withGPUAdmission); CPU-placed kinds skip admission entirely.
All unit tests pass; gofmt + vet clean. Notes: Device Utilization % from ioreg is unreliable on this AMD card — validation uses xctrace Metal System Trace (display-surface-swap).

Validated locally on Mac Pro 2019 + Vega II: chat ~5.5x faster on CPU; governed display-on load held the compositor (no watchdog panic) where uncapped load crashed 4x.

On a single-GPU Mac, sustained back-to-back inference (eval suites, bulk ingest, back-to-back RAG) can monopolize the Metal command queue long enough that WindowServer misses its kernel watchdog check-ins and the machine panics and reboots. Interactive use survives because it is bursty; sustained batch load removes the idle gaps the compositor needs. This adds a pressure governor that adapts the (previously unwired) admission scheduler's concurrency ceiling to host signals: - internal/pressure: a Sensor reading display power state (ioreg IODisplayWrangler CurrentPowerState vs MaxPowerState) and memory pressure (sysctl kern.memorystatus_vm_pressure_level), and a Limits.Target mapping — full throughput when headless or the display is asleep; a reduced ceiling while a screen is being driven (reserving GPU headroom for the compositor); further backoff on memory warn/critical. No CGo (shells out with timeouts); non-darwin builds report headless → full throughput. - scheduler: SetConcurrency/Concurrency for runtime ceiling adjustment. - gateway: a gated() middleware that runs the scheduler around every GPU-bound route — the single chokepoint covering both the reverse-proxy handlers and the Ollama-translation handlers (which forward via their own client). Chat is prioritized over batch embed/rerank; saturation returns 503 + Retry-After. /health surfaces live governor state. - config: QUENCHFORGE_GOVERNOR / _GPU_CONCURRENCY_MAX / _GPU_CONCURRENCY_DISPLAY_ACTIVE / _GOVERNOR_INTERVAL_MS (default on; 6 headless, 2 display-active). Tested: scheduler/pressure/gateway/config unit tests green; go build ./... and all internal+cmd packages pass.

Concurrency capping alone does not prevent display-compositor (WindowServer) starvation: sustained gapless GPU command buffers monopolize the Metal queue at any concurrency because macOS AMD has no compositor preemption. The fix is temporal — force GPU idle windows when a display is being driven. - scheduler: SetDutyCycle/DutyCycle (governor knob, stored). - gateway gated(): after each request, hold the slot idle for idle = busy*(1-d)/d (capped by GovernorMaxCooldownMS) so the GPU yields a window to the compositor before the next admission. - pressure: Limits.For(reading) -> Plan{Concurrency, Duty}; display-active now serializes (conc 1) and applies the duty cycle; memory pressure tightens it; headless/asleep -> full throughput, duty 1.0 (no gaps, no regression). - config: GPUDutyCycleDisplayActive (0.5), GovernorMaxCooldownMS (250); display-active concurrency default 2 -> 1. Tested: duty Set/clamp, For() plan mapping, gated cooldown timing + cap + no-cooldown-at-duty-1; go build + all internal/cmd packages pass.

Makes device placement a first-class, hardware-adaptive control plane rather than a hardcoded --gpu-layers in tuning. Pairs with the admission/duty governor: placement decides WHERE a workload runs; the governor decides HOW MUCH GPU it gets. Together they are quenchforge's resource-control framework. - internal/placement: Policy with per-kind modes gpu|cpu|auto, hardware-adaptive defaults (AMD-discrete: chat=cpu, embed/code-embed/rerank=gpu; non-AMD: all gpu), per-kind operator overrides, and RouteRequest(kind,batchN,threshold) for 'auto' kinds (single->CPU latency, batched->GPU throughput). - tuning.KernelParams now consults placement: CPU placement returns minimal CPU tuning (--gpu-layers 0, no Metal env, cache left on); GPU placement keeps the existing AMD safety tuning. PolicyFor() exposed for the gateway's future per-request auto-routing. - config: QUENCHFORGE_PLACE_{CHAT,EMBED,CODE_EMBED,RERANK}=gpu|cpu|auto overrides. Default effect on this Mac Pro + Vega II: chat -> CPU (measured ~7x faster than the AMD Metal path; corroborated by bench-llama-sustained-load p50=29.4s) and removed from display-compositor GPU contention; embed/code-embed/rerank stay on GPU (v0.8.0 batched-throughput win). Tests cover defaults, overrides, auto routing, and the GPU chat path under PlaceChat=gpu.

…nk CPU default Completes the dynamic half of the placement framework. Embedding kinds can run 'auto': the gateway routes each request by input count — single (latency) to a CPU instance, batched (throughput) to the GPU instance — getting the best of both (quenchforge bench: GPU ~2.5x batched; CPU faster single-request). - placement: AMD-discrete rerank default gpu->cpu (query-time/latency-bound, measured faster on CPU single-request; also removes it from GPU contention). - tuning: KernelParamsForDevice(...,dev) for explicit per-instance device. - gateway: cpuUpstreams + SetCPUUpstream + SetPlacement; routeEmbed() picks the instance by mode+batch; gated() refactored to withGPUAdmission and is now placement-aware (CPU-placed kinds skip GPU admission entirely); both embed handlers route per request and govern only the GPU branch. - config: QUENCHFORGE_{EMBED,CODE_EMBED}_CPU_PORT + AUTO_BATCH_THRESHOLD. - main: dual-launch (GPU primary + CPU instance) for 'auto' embed/code-embed; single-instance otherwise. SetPlacement wired from the host policy. DORMANT by default: with no kind set to 'auto', no CPU instance launches and routeEmbed falls through to the single upstream — behaviour identical to before. All unit tests pass; new tests cover routeEmbed, countEmbedInputs, gated placement-awareness, and KernelParamsForDevice.

sunrunnerfire added 4 commits June 7, 2026 19:00

sunrunnerfire merged commit eecd39c into main Jun 8, 2026
6 checks passed

sunrunnerfire deleted the feat/gpu-pressure-governor branch June 8, 2026 02:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU resource-control framework: pressure governor + workload placement#20

GPU resource-control framework: pressure governor + workload placement#20
sunrunnerfire merged 4 commits into
mainfrom
feat/gpu-pressure-governor

sunrunnerfire commented Jun 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sunrunnerfire commented Jun 8, 2026

Placement (where work runs) — internal/placement

Governor (how much GPU) — internal/pressure + internal/scheduler + gateway

Safety / compatibility

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Placement (where work runs) — `internal/placement`

Governor (how much GPU) — `internal/pressure` + `internal/scheduler` + gateway