Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
46 changes: 46 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,52 @@ patch bumps fix bugs or polish without behaviour change.

---

## v0.8.0 — AMD-discrete GPU mode + VRAM-tier-adaptive sizing (2026-05-31)

Promotes the `v0.8.0-rc2` AMD-discrete GPU-mode revival (below) to a
final release and adds **VRAM-tier-adaptive slot sizing** so the full
range of Intel-Mac AMD GPUs — not just the 32 GB Vega II — runs
out-of-the-box without operator hand-tuning.

### VRAM-tier-adaptive sizing

Prior to this release every AMD-discrete profile inherited the Vega II
bench constants (`--ctx-size 8192`, embed `--ubatch-size 1024`)
regardless of card. On a smaller card (8 GB RX 5700, 4 GB MacBook Pro
dGPU) those defaults could oversubscribe VRAM, forcing the operator to
discover and set `QUENCHFORGE_MAX_CONTEXT` / `QUENCHFORGE_EMBED_UBATCH_SIZE`
by hand. `internal/tuning/tuning.go::amdSizing` now derives both from the
detected headline VRAM (`hardware.Info.GPUVRAMGB`, threaded into
`KernelParams`):

| VRAM | `--ctx-size` cap | embed `--ubatch-size` | example cards |
|---|---|---|---|
| ≥ 12 GB | none (keeps `MaxContext`) | 1024 | Vega II/Duo, W6800X, W6900X, Vega 56/64, 5600M |
| 7–11 GB | 4096 | 512 | RX 5700 / 5700 XT, W5700X |
| ≤ 6 GB | 2048 | 256 | 4 GB MacBook Pro dGPUs (5300M/5500M), Polaris 560X |

Design guarantees:

- **Zero regression on the validated path.** The ≥ 12 GB tier (and any
VRAM probe that returns 0/unknown) keeps the exact Vega-II-benched
values, so the canonical Mac Pro config is byte-for-byte unchanged.
- **Caps only ever lower.** `buildSlotArgs` applies the context ceiling
as `min(cfg.MaxContext, cap)`, so an operator who raised
`QUENCHFORGE_MAX_CONTEXT` on a big card is never clamped.
- **Operator overrides still win.** An explicit
`QUENCHFORGE_EMBED_UBATCH_SIZE` beats the tier ubatch; the context cap
is an independent safety knob.
- The fix is family-agnostic: unlisted/future AMD cards fall through
`classifyProfile` to `vega-pro` and are sized by VRAM like any other.

New coverage: `TestAmdSizing_Tiers`, `TestKernelParams_EmbedLowVRAMScalesDown`,
`TestKernelParams_ContextCapAppliesToAllAMDSlots`,
`TestKernelParams_HighVRAMAndNonAMDHaveNoContextCap`,
`TestKernelParams_UbatchOverrideBeatsTierButCapStands`, and
`TestBuildSlotArgs_LowVRAMAMDCapsContextAndUbatch`.

---

## v0.8.0-rc2 — AMD-discrete GPU mode revival (2026-05-25)

The Mac Pro 7,1 + Radeon Pro Vega II 32 GB configuration now runs all
Expand Down
18 changes: 12 additions & 6 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -145,18 +145,24 @@ that copy require a maintainer review.
2. **Embed/rerank slots have their own AMD safety surface — section 3.**
The family-B graph-compute buffer-corruption crash hits embed/rerank
under sustained batch load (eval suites, bulk KB ingest, sustained
MCP retrieval). Operators on AMD discrete running those workloads
should set:
MCP retrieval). As of **v0.8.0 the embed ubatch + context ceiling are
VRAM-tier-adaptive** (`internal/tuning/tuning.go::amdSizing`): the
detected `GPUVRAMGB` picks 1024/none (≥ 12 GB), 512/4096 (8 GB), or
256/2048 (4 GB) automatically, so small cards no longer need
hand-tuning. Operators only set the env vars below to *override* the
tier (force a value, or apply the same safety knobs on a profile the
detector classified as non-AMD):

```
QUENCHFORGE_EMBED_UBATCH_SIZE=1024 # or smaller — caps Metal staging-buffer pressure
QUENCHFORGE_EMBED_UBATCH_SIZE=1024 # override tier ubatch — caps Metal staging-buffer pressure
QUENCHFORGE_EMBED_METAL_N_CB=1 # serialise command-buffer submission
QUENCHFORGE_AUTO_BACKOFF=true # auto-503 before SIGABRT
```

Defaults preserve historical behaviour. `quenchforge-bench
sustained-embed` is the empirical tuning tool; the follow-up PR
will land bench-driven Vega-II defaults.
The ≥ 12 GB tier keeps the bench-validated Vega II values verbatim;
a VRAM probe miss (0/unknown) is treated as high tier so detection
failures never throttle the validated path. `quenchforge-bench
sustained-embed` remains the empirical tuning tool for new families.

3. **`internal/tuning/` is the sole owner of per-(profile, kind) slot
tuning.** `cmd/quenchforge/main.go::buildSlotArgs` and `slotEnv`
Expand Down
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -176,9 +176,9 @@ All settings have sensible defaults. Selected env vars:
| `QUENCHFORGE_MODELS_DIR` | `~/.quenchforge/models` | Where Quenchforge looks for GGUFs |
| `QUENCHFORGE_LOG_DIR` | `~/Library/Logs/quenchforge` | Per-slot log files land here |
| `QUENCHFORGE_PID_DIR` | `~/.config/quenchforge/pids` | Orphan-reaper pidfile dir |
| `QUENCHFORGE_MAX_CONTEXT` | `8192` | `--ctx-size` passed to every slot |
| `QUENCHFORGE_MAX_CONTEXT` | `8192` | `--ctx-size` passed to every slot. On AMD-discrete cards ≤ 11 GB this is auto-capped by VRAM tier (4096 on 8 GB, 2048 on 4 GB) so the KV cache fits; the cap only lowers, never raises, your value. ≥ 12 GB cards use it verbatim. |
| `QUENCHFORGE_METAL_N_CB` | `2` | Metal command-buffer count (`GGML_METAL_N_CB`); global default — per-slot overrides below |
| `QUENCHFORGE_EMBED_UBATCH_SIZE` | `0` (inherit MaxContext) | Per-call `--batch-size` / `--ubatch-size` for embed and code-embed slots. On AMD discrete, lowering this (e.g. `1024`) caps Metal staging-buffer pressure and prevents the family-B sustained-load SIGABRT documented in `patches/README.md` section 3. |
| `QUENCHFORGE_EMBED_UBATCH_SIZE` | `0` (auto) | Per-call `--batch-size` / `--ubatch-size` for embed and code-embed slots. Zero auto-sizes by VRAM tier on AMD discrete (1024 on ≥ 12 GB, 512 on 8 GB, 256 on 4 GB) to cap Metal staging-buffer pressure and prevent the family-B sustained-load SIGABRT (`patches/README.md` section 3); non-AMD inherits MaxContext. An explicit value overrides the tier. |
| `QUENCHFORGE_EMBED_METAL_N_CB` | `0` (inherit `METAL_N_CB`) | Per-slot `GGML_METAL_N_CB` for embed and code-embed. Set to `1` on AMD discrete to serialise Metal command-buffer submission. |
| `QUENCHFORGE_RERANK_BATCH_SIZE` | `0` (llama.cpp's 512-token default) | Rerank slot `--batch-size` and `--ubatch-size`. Raise this when the reranker takes (query, doc) pairs longer than 510 tokens (e.g. `bge-reranker-v2-m3` with ≥ 1k-token chunks). |
| `QUENCHFORGE_RERANK_METAL_N_CB` | `0` (inherit `METAL_N_CB`) | Per-slot `GGML_METAL_N_CB` for the rerank slot. |
Expand Down
16 changes: 12 additions & 4 deletions cmd/quenchforge/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -908,15 +908,23 @@ type slotSpec struct {
// only for the base arg shape plus the layering of the tuning result.
// Move the per-profile decisions there when they change, not here.
func buildSlotArgs(cfg config.Config, hwInfo hardware.Info, spec slotSpec, modelPath string) []string {
tn := tuning.KernelParams(hwInfo.Profile, hwInfo.GPUVRAMGB, spec.Kind, cfg)

// VRAM-tier-adaptive context ceiling: ContextSize only ever lowers
// cfg.MaxContext (small AMD cards), never raises it.
ctxSize := cfg.MaxContext
if tn.ContextSize > 0 && tn.ContextSize < ctxSize {
ctxSize = tn.ContextSize
}

args := []string{
"--model", modelPath,
"--host", "127.0.0.1",
"--port", fmt.Sprintf("%d", spec.Port),
"--ctx-size", fmt.Sprintf("%d", cfg.MaxContext),
"--ctx-size", fmt.Sprintf("%d", ctxSize),
}
args = append(args, spec.ExtraArgs...)

tn := tuning.KernelParams(hwInfo.Profile, spec.Kind, cfg)
if tn.BatchSize > 0 {
args = append(args, "--batch-size", fmt.Sprintf("%d", tn.BatchSize))
}
Expand All @@ -936,7 +944,7 @@ func buildSlotArgs(cfg config.Config, hwInfo hardware.Info, spec slotSpec, model
// on AMD discrete).
func slotEnv(cfg config.Config, hwInfo hardware.Info, kind gateway.SlotKind) []string {
ncb := cfg.MetalNCB
tn := tuning.KernelParams(hwInfo.Profile, kind, cfg)
tn := tuning.KernelParams(hwInfo.Profile, hwInfo.GPUVRAMGB, kind, cfg)
if tn.MetalNCB > 0 {
ncb = tn.MetalNCB
}
Expand Down Expand Up @@ -1022,7 +1030,7 @@ func startSlot(ctx context.Context, cfg config.Config, hwInfo hardware.Info, spe
// graph-compute buffer-corruption crash is non-deterministic and the
// slot stays dead after SIGABRT until manual restart. Tuning module
// owns the decision; we just translate AutoRespawn → RestartPolicy.
tn := tuning.KernelParams(hwInfo.Profile, spec.Kind, cfg)
tn := tuning.KernelParams(hwInfo.Profile, hwInfo.GPUVRAMGB, spec.Kind, cfg)
if tn.AutoRespawn {
slot.RestartPolicy = supervisor.PolicyExpBackoff
}
Expand Down
29 changes: 29 additions & 0 deletions cmd/quenchforge/serve_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -343,6 +343,35 @@ func TestBuildSlotArgs_EmbedKindsBatchOverride(t *testing.T) {
}
}

func TestBuildSlotArgs_LowVRAMAMDCapsContextAndUbatch(t *testing.T) {
// v0.8.0 adaptive sizing: an 8 GB AMD card (e.g. RX 5700) must get a
// capped --ctx-size 4096 (down from MaxContext 8192) and --ubatch-size
// 512 on embed/chat without any operator env var — the gap that used
// to force manual QUENCHFORGE_MAX_CONTEXT / _EMBED_UBATCH_SIZE tuning.
cfg := config.Config{MaxContext: 8192}
info := hardware.Info{Profile: hardware.ProfileRDNA1, GPUVRAMGB: 8}

embedArgs := buildSlotArgs(cfg, info, slotSpec{Kind: gateway.KindEmbed, Name: "embed", Port: 11501}, "/tmp/e.gguf")
if !containsArgPair(embedArgs, "--ctx-size", "4096") {
t.Errorf("8 GB AMD embed missing capped --ctx-size 4096: %v", embedArgs)
}
if !containsArgPair(embedArgs, "--ubatch-size", "512") {
t.Errorf("8 GB AMD embed missing scaled --ubatch-size 512: %v", embedArgs)
}

chatArgs := buildSlotArgs(cfg, info, slotSpec{Kind: gateway.KindChat, Name: "chat", Port: 11500}, "/tmp/c.gguf")
if !containsArgPair(chatArgs, "--ctx-size", "4096") {
t.Errorf("8 GB AMD chat missing capped --ctx-size 4096: %v", chatArgs)
}

// Regression: a 32 GB card keeps the full configured context.
hi := hardware.Info{Profile: hardware.ProfileVegaPro, GPUVRAMGB: 32}
hiArgs := buildSlotArgs(cfg, hi, slotSpec{Kind: gateway.KindChat, Name: "chat", Port: 11500}, "/tmp/c.gguf")
if !containsArgPair(hiArgs, "--ctx-size", "8192") {
t.Errorf("32 GB AMD chat should keep --ctx-size 8192 (no cap): %v", hiArgs)
}
}

func TestBuildSlotArgs_NonEmbedKindsSkipBatchOverride(t *testing.T) {
// Chat / rerank / whisper slots don't need the embed batch override
// (they decode autoregressively or operate per-pair). Adding it
Expand Down
81 changes: 65 additions & 16 deletions internal/tuning/tuning.go
Original file line number Diff line number Diff line change
Expand Up @@ -16,9 +16,11 @@
// kind on this hardware profile want", with table-driven tests.
//
// The function is intentionally pure (no I/O, no globals): it consumes
// a profile, a slot kind, and a config snapshot, and returns a
// `SlotTuning` describing the additional llama-server flags and env
// vars the supervisor should layer on top of the base argv.
// a profile, the detected GPU VRAM (GB), a slot kind, and a config
// snapshot, and returns a `SlotTuning` describing the additional
// llama-server flags and env vars the supervisor should layer on top of
// the base argv. VRAM drives the adaptive context/ubatch sizing (see
// amdSizing) so smaller AMD cards fit without operator hand-tuning.
//
// Honors operator overrides: env-driven config fields (cfg.EmbedUbatchSize,
// cfg.EmbedMetalNCB, cfg.RerankBatchSize, cfg.RerankMetalNCB) win over
Expand Down Expand Up @@ -93,6 +95,14 @@ type SlotTuning struct {
// models and chat-decode races. See llama.cpp issue #19563 and patch 0002.
// Apple Silicon (UMA) does not need this; the concurrent path is correct there.
MetalConcurrencyDisable bool

// ContextSize, when non-zero, is a VRAM-tier-derived ceiling on the
// slot's --ctx-size. buildSlotArgs applies it as min(cfg.MaxContext,
// ContextSize), so it only ever LOWERS the configured context: small
// AMD cards (<= 11 GB) get a KV cache that fits without manual tuning,
// while >= 12 GB cards and non-AMD profiles leave this 0 (no cap) and
// keep cfg.MaxContext verbatim. See amdSizing.
ContextSize int
}

// KernelParams returns the tuning the supervisor should apply for the
Expand All @@ -108,14 +118,14 @@ type SlotTuning struct {
// `~/Develop/quenchforge/llama.cpp/ggml/src/ggml-metal/ggml-metal-device.m:1665-1717`
// — the `buf->is_shared` fast path uses plain `memcpy`). Adding flags
// on those profiles would regress throughput without any safety win.
func KernelParams(profile hardware.Profile, kind gateway.SlotKind, cfg config.Config) SlotTuning {
func KernelParams(profile hardware.Profile, vramGB int, kind gateway.SlotKind, cfg config.Config) SlotTuning {
switch kind {
case gateway.KindChat:
return chatParams(profile)
return chatParams(profile, vramGB)
case gateway.KindEmbed, gateway.KindCodeEmbed:
return embedParams(profile, cfg)
return embedParams(profile, vramGB, cfg)
case gateway.KindRerank:
return rerankParams(profile, cfg)
return rerankParams(profile, vramGB, cfg)
}
// Whisper / imagegen and any future kinds fall through unchanged.
return SlotTuning{}
Expand All @@ -129,10 +139,11 @@ func KernelParams(profile hardware.Profile, kind gateway.SlotKind, cfg config.Co
// theory that chat is naturally bursty; cerid eval workloads broke
// that assumption (chat.log entry at 2026-05-16T23:14 — task 143
// hit GGML_ASSERT at `set_tensor` after ~30 successful chat calls).
func chatParams(profile hardware.Profile) SlotTuning {
func chatParams(profile hardware.Profile, vramGB int) SlotTuning {
if !profileIsAMDDiscrete(profile) {
return SlotTuning{}
}
ctxCap, _ := amdSizing(vramGB)
// AMD-discrete chat slot runs on GPU as of v0.8.0. The MTLDispatchTypeConcurrent
// race that produced cross-call non-determinism is disabled via
// MetalConcurrencyDisable -> GGML_METAL_CONCURRENCY_DISABLE=1. The family-B
Expand All @@ -150,6 +161,7 @@ func chatParams(profile hardware.Profile) SlotTuning {
"--no-cache-prompt",
"--gpu-layers", "999",
},
ContextSize: ctxCap,
MetalConcurrencyDisable: true,
AutoRespawn: true,
}
Expand All @@ -174,23 +186,28 @@ func chatParams(profile hardware.Profile) SlotTuning {
// - AMD discrete profiles additionally enable AutoRespawn — the
// supervisor brings the slot back on a Metal SIGABRT instead of
// leaving it dead until manual restart.
func embedParams(profile hardware.Profile, cfg config.Config) SlotTuning {
func embedParams(profile hardware.Profile, vramGB int, cfg config.Config) SlotTuning {
ubatch := cfg.MaxContext
metalNCB := cfg.MetalNCB
ctxCap := 0
if profileIsAMDDiscrete(profile) {
// AMD-discrete on GPU (v0.8.0) needs the 1024 ubatch cap re-enabled —
// AMD-discrete on GPU (v0.8.0) needs the ubatch cap re-enabled —
// it bounds per-call Metal staging-buffer pressure even with patch 0002's
// pool in place. CLAUDE.md operational gotcha #2 documents this knob.
ubatch = amdEmbedUbatchDefault
// pool in place. As of v0.8.0 the cap is VRAM-tier-adaptive (1024 on
// >= 12 GB cards, 512 on 8 GB, 256 on 4 GB) so smaller cards don't OOM
// without an operator setting QUENCHFORGE_EMBED_UBATCH_SIZE by hand.
// CLAUDE.md operational gotcha #2 documents this knob.
ctxCap, ubatch = amdSizing(vramGB)
metalNCB = amdEmbedMetalNCBDefault
}
if cfg.EmbedUbatchSize > 0 {
ubatch = cfg.EmbedUbatchSize
}
t := SlotTuning{
UbatchSize: ubatch,
BatchSize: ubatch,
MetalNCB: metalNCB,
UbatchSize: ubatch,
BatchSize: ubatch,
MetalNCB: metalNCB,
ContextSize: ctxCap,
}
if cfg.EmbedMetalNCB > 0 {
t.MetalNCB = cfg.EmbedMetalNCB
Expand Down Expand Up @@ -224,13 +241,15 @@ func embedParams(profile hardware.Profile, cfg config.Config) SlotTuning {
//
// AutoRespawn fires on AMD discrete same as embed. AMD profiles also
// get the conservative MetalNCB=1 default same as embed.
func rerankParams(profile hardware.Profile, cfg config.Config) SlotTuning {
func rerankParams(profile hardware.Profile, vramGB int, cfg config.Config) SlotTuning {
t := SlotTuning{}
if cfg.RerankBatchSize > 0 {
t.BatchSize = cfg.RerankBatchSize
t.UbatchSize = cfg.RerankBatchSize
}
if profileIsAMDDiscrete(profile) {
ctxCap, _ := amdSizing(vramGB)
t.ContextSize = ctxCap
t.MetalNCB = amdEmbedMetalNCBDefault
}
if cfg.RerankMetalNCB > 0 {
Expand All @@ -251,6 +270,36 @@ func rerankParams(profile hardware.Profile, cfg config.Config) SlotTuning {
return t
}

// amdSizing returns the VRAM-tier-adaptive context ceiling and embed
// ubatch for an AMD-discrete profile. A contextCap of 0 means "no cap —
// honour cfg.MaxContext as-is".
//
// The high tier (>= 12 GB: Vega II/Duo, W6800X, W6900X, Vega 56/64,
// 5600M) keeps the Vega-II-validated values: no context cap, ubatch
// 1024. Smaller cards scale both down so the KV cache + Metal staging
// buffers fit without an operator hand-tuning QUENCHFORGE_MAX_CONTEXT /
// QUENCHFORGE_EMBED_UBATCH_SIZE:
//
// VRAM context cap embed ubatch example cards
// >= 12 GB none (0) 1024 Vega II/Duo, W6800X, W6900X, Vega 56/64, 5600M
// 7-11 GB 4096 512 RX 5700 / 5700 XT, W5700X
// <= 6 GB 2048 256 4 GB MacBook Pro dGPUs (5300M/5500M), Polaris 560X
//
// vramGB <= 0 means detection could not read VRAM; treat it as the high
// tier so a probe miss never throttles the validated Vega II path. The
// caps only ever LOWER cfg.MaxContext (buildSlotArgs takes the min), so a
// high-VRAM operator who raised QUENCHFORGE_MAX_CONTEXT is unaffected.
func amdSizing(vramGB int) (contextCap, ubatch int) {
switch {
case vramGB <= 0 || vramGB >= 12:
return 0, amdEmbedUbatchDefault
case vramGB >= 7:
return 4096, 512
default:
return 2048, 256
}
}

// profileIsAMDDiscrete inlines hardware.Info.IsAMDDiscrete logic
// against a raw Profile (we don't have a full Info here). Kept in sync
// with internal/hardware/hardware.go::IsAMDDiscrete by the test in
Expand Down
Loading
Loading