GPU resource-control framework: pressure governor + workload placement#20
Merged
Conversation
On a single-GPU Mac, sustained back-to-back inference (eval suites, bulk ingest, back-to-back RAG) can monopolize the Metal command queue long enough that WindowServer misses its kernel watchdog check-ins and the machine panics and reboots. Interactive use survives because it is bursty; sustained batch load removes the idle gaps the compositor needs. This adds a pressure governor that adapts the (previously unwired) admission scheduler's concurrency ceiling to host signals: - internal/pressure: a Sensor reading display power state (ioreg IODisplayWrangler CurrentPowerState vs MaxPowerState) and memory pressure (sysctl kern.memorystatus_vm_pressure_level), and a Limits.Target mapping — full throughput when headless or the display is asleep; a reduced ceiling while a screen is being driven (reserving GPU headroom for the compositor); further backoff on memory warn/critical. No CGo (shells out with timeouts); non-darwin builds report headless → full throughput. - scheduler: SetConcurrency/Concurrency for runtime ceiling adjustment. - gateway: a gated() middleware that runs the scheduler around every GPU-bound route — the single chokepoint covering both the reverse-proxy handlers and the Ollama-translation handlers (which forward via their own client). Chat is prioritized over batch embed/rerank; saturation returns 503 + Retry-After. /health surfaces live governor state. - config: QUENCHFORGE_GOVERNOR / _GPU_CONCURRENCY_MAX / _GPU_CONCURRENCY_DISPLAY_ACTIVE / _GOVERNOR_INTERVAL_MS (default on; 6 headless, 2 display-active). Tested: scheduler/pressure/gateway/config unit tests green; go build ./... and all internal+cmd packages pass.
Concurrency capping alone does not prevent display-compositor (WindowServer)
starvation: sustained gapless GPU command buffers monopolize the Metal queue at
any concurrency because macOS AMD has no compositor preemption. The fix is
temporal — force GPU idle windows when a display is being driven.
- scheduler: SetDutyCycle/DutyCycle (governor knob, stored).
- gateway gated(): after each request, hold the slot idle for
idle = busy*(1-d)/d (capped by GovernorMaxCooldownMS) so the GPU yields a
window to the compositor before the next admission.
- pressure: Limits.For(reading) -> Plan{Concurrency, Duty}; display-active now
serializes (conc 1) and applies the duty cycle; memory pressure tightens it;
headless/asleep -> full throughput, duty 1.0 (no gaps, no regression).
- config: GPUDutyCycleDisplayActive (0.5), GovernorMaxCooldownMS (250);
display-active concurrency default 2 -> 1.
Tested: duty Set/clamp, For() plan mapping, gated cooldown timing + cap +
no-cooldown-at-duty-1; go build + all internal/cmd packages pass.
Makes device placement a first-class, hardware-adaptive control plane rather
than a hardcoded --gpu-layers in tuning. Pairs with the admission/duty governor:
placement decides WHERE a workload runs; the governor decides HOW MUCH GPU it
gets. Together they are quenchforge's resource-control framework.
- internal/placement: Policy with per-kind modes gpu|cpu|auto, hardware-adaptive
defaults (AMD-discrete: chat=cpu, embed/code-embed/rerank=gpu; non-AMD: all
gpu), per-kind operator overrides, and RouteRequest(kind,batchN,threshold) for
'auto' kinds (single->CPU latency, batched->GPU throughput).
- tuning.KernelParams now consults placement: CPU placement returns minimal CPU
tuning (--gpu-layers 0, no Metal env, cache left on); GPU placement keeps the
existing AMD safety tuning. PolicyFor() exposed for the gateway's future
per-request auto-routing.
- config: QUENCHFORGE_PLACE_{CHAT,EMBED,CODE_EMBED,RERANK}=gpu|cpu|auto overrides.
Default effect on this Mac Pro + Vega II: chat -> CPU (measured ~7x faster than
the AMD Metal path; corroborated by bench-llama-sustained-load p50=29.4s) and
removed from display-compositor GPU contention; embed/code-embed/rerank stay on
GPU (v0.8.0 batched-throughput win). Tests cover defaults, overrides, auto
routing, and the GPU chat path under PlaceChat=gpu.
…nk CPU default
Completes the dynamic half of the placement framework. Embedding kinds can run
'auto': the gateway routes each request by input count — single (latency) to a
CPU instance, batched (throughput) to the GPU instance — getting the best of
both (quenchforge bench: GPU ~2.5x batched; CPU faster single-request).
- placement: AMD-discrete rerank default gpu->cpu (query-time/latency-bound,
measured faster on CPU single-request; also removes it from GPU contention).
- tuning: KernelParamsForDevice(...,dev) for explicit per-instance device.
- gateway: cpuUpstreams + SetCPUUpstream + SetPlacement; routeEmbed() picks the
instance by mode+batch; gated() refactored to withGPUAdmission and is now
placement-aware (CPU-placed kinds skip GPU admission entirely); both embed
handlers route per request and govern only the GPU branch.
- config: QUENCHFORGE_{EMBED,CODE_EMBED}_CPU_PORT + AUTO_BATCH_THRESHOLD.
- main: dual-launch (GPU primary + CPU instance) for 'auto' embed/code-embed;
single-instance otherwise. SetPlacement wired from the host policy.
DORMANT by default: with no kind set to 'auto', no CPU instance launches and
routeEmbed falls through to the single upstream — behaviour identical to before.
All unit tests pass; new tests cover routeEmbed, countEmbedInputs, gated
placement-awareness, and KernelParamsForDevice.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds a two-part resource-control framework so sustained inference on a single-GPU Mac can't starve the display compositor (WindowServer) into a kernel-watchdog panic, and so each workload runs on the device that actually serves it best.
Placement (where work runs) —
internal/placementPer-kind device policy
gpu|cpu|auto, hardware-adaptive defaults, env overrides (QUENCHFORGE_PLACE_{CHAT,EMBED,CODE_EMBED,RERANK}).bench-llama-sustained-loadp50=29.4s), and it removes them from compositor GPU contention.auto: the gateway routes each request by input count — single→CPU (latency), batched→GPU (throughput) — via a dual GPU+CPU instance.Governor (how much GPU) —
internal/pressure+internal/scheduler+ gatewayAdaptive GPU admission: concurrency cap + duty-cycle idle gaps (the key lever — concurrency capping alone does not prevent starvation; sustained gapless command buffers do, at any concurrency). Driven by a display-power + memory-pressure sensor (full throughput when headless/asleep; reserve compositor headroom when a display is driven). Env:
QUENCHFORGE_GOVERNOR,_GPU_CONCURRENCY_{MAX,DISPLAY_ACTIVE},_GPU_DUTY_DISPLAY_ACTIVE,_GOVERNOR_MAX_COOLDOWN_MS.Safety / compatibility
autokind → no second instance, single-upstream path unchanged. Governor is a no-op when headless.withGPUAdmission); CPU-placed kinds skip admission entirely.Device Utilization %from ioreg is unreliable on this AMD card — validation uses xctrace Metal System Trace (display-surface-swap).Validated locally on Mac Pro 2019 + Vega II: chat ~5.5x faster on CPU; governed display-on load held the compositor (no watchdog panic) where uncapped load crashed 4x.