Environment
- Hardware: Apple M1 Max, 64 GB unified memory
- OS: macOS 26.3 (25D125)
- Binary: llama-server (turboquant fork,
feature/turboquant-kv-cache)
- Bad commit:
8590cbff9 (Apr 9, 2026) — includes 0d6b38aad
- Good commit:
a4e8af445 (Apr 7, 2026) — last build before TurboFlash
Summary
Commit 0d6b38aad ("metal: add TurboFlash attention kernel for turbo3 KV cache decode") causes the Metal compute buffer to be over-allocated by ~13 GB on Apple Silicon. The server runs correctly but consumes nearly all 64 GB of unified memory on a system where the stable build leaves ~17 GB free.
Root Cause: Metal Compute Buffer Over-Allocation
Both builds load the identical model (Qwopus-MoE-35B-A3B, 21.24 GiB GGUF), identical KV cache (510 MiB, q8_0 K / turbo4 V), and identical graph (3739 nodes, 2 splits). The difference is entirely in Metal compute buffers:
BUGGY (8590cbff9) STABLE (a4e8af445) DELTA
─────────────────────────────────────────────────────────────────────────────────
sched_reserve MTL0: 8,480.87 MiB 489.00 MiB +7,991 MiB (16.3×)
alloc_compute_meta MTL0: 5,355.99 MiB 248.10 MiB +5,107 MiB (21.6×)
─────────────────────────────────────────────────────────────────────────────────
TOTAL EXCESS: ~13,098 MiB (+13 GB)
sched_reserve time: 813.16 ms 28.52 ms 28.5× slower
Memory Breakdown (from llama-server logs)
BUGGY: MTL0 | 53084 = 16064 free + (30805 = 21751 model + 572 context + 8480 compute) + 6214 unaccounted
STABLE: MTL0 | 53084 = 29164 free + (22813 = 21751 model + 572 context + 489 compute) + 1106 unaccounted
───── ──── ────
-13100 free +7991 compute +5108 unaccounted
The buggy build leaves only 16 GB free on the GPU vs 29 GB free for stable — a 13 GB difference that maps exactly to the inflated sched_reserve + alloc_compute_meta buffers.
System-Level Impact
BUGGY STABLE
System wired memory: 60 GB 47 GB (+13 GB)
System free memory: ~0 GB 12 GB
Projected device usage: 30,407 MiB 22,415 MiB (+7,992 MiB)
Process RSS: 23,237 MB 23,419 MB (similar — model dominates)
On a 64 GB M1 Max, the buggy build uses almost all available memory and triggers swap/compression.
Steps to Reproduce
# Build buggy version
git clone https://github.com/TheTom/llama-cpp-turboquant.git ~/turboquant-test
cd ~/turboquant-test && git checkout 8590cbff9
cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release -DCMAKE_OSX_ARCHITECTURES=arm64
cmake --build build --config Release -j$(sysctl -n hw.ncpu)
# Start server (loads model on first request)
./build/bin/llama-server \
--models-preset models_turbo3.ini --models-max 1 --mlock \
--port 8000 --host 127.0.0.1 \
--cont-batching --batch-size 2048 --ubatch-size 512 \
--no-context-shift --timeout 1200
# Trigger model load
curl -s http://127.0.0.1:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "Qwopus-MoE-35B-A3B-APEX-I-Quality", "messages": [{"role":"user","content":"hi"}], "max_tokens": 1}'
# Check server log for sched_reserve size — should be ~489 MiB, not ~8480 MiB
grep sched_reserve llama-server.log
Model Config Used
[Qwopus-MoE-35B-A3B-APEX-I-Quality]
model = Qwopus-MoE-35B-A3B-APEX-I-Quality.gguf ; 21.24 GiB, MoE 35B (~3B active)
n-gpu-layers = 99
flash-attn = on
ctx-size = 65536
parallel = 1
cache-type-k = q8_0
cache-type-v = turbo4
Bisection
a4e8af445 2026-04-07 perf: TQ4_1S native kernel 3.5× faster + AMD arch dispatch ← WORKS (489 MiB compute)
0d6b38aad 2026-04-08 metal: add TurboFlash attention kernel for turbo3 KV cache decode ← BROKEN (8480 MiB compute)
Suggested Fix Location
The TurboFlash attention kernel's graph planning in ggml-metal.mm (or wherever sched_reserve computes buffer sizes) is likely sizing the compute buffer for the full model/context instead of per-layer, or allocating scratch space that grows with context length when it shouldn't.
Key clue: the alloc_compute_meta buffer also blows up (248 → 5355 MiB), suggesting the issue is in the graph node allocation, not just the scheduler reservation.
Environment
feature/turboquant-kv-cache)8590cbff9(Apr 9, 2026) — includes0d6b38aada4e8af445(Apr 7, 2026) — last build before TurboFlashSummary
Commit
0d6b38aad("metal: add TurboFlash attention kernel for turbo3 KV cache decode") causes the Metal compute buffer to be over-allocated by ~13 GB on Apple Silicon. The server runs correctly but consumes nearly all 64 GB of unified memory on a system where the stable build leaves ~17 GB free.Root Cause: Metal Compute Buffer Over-Allocation
Both builds load the identical model (Qwopus-MoE-35B-A3B, 21.24 GiB GGUF), identical KV cache (510 MiB, q8_0 K / turbo4 V), and identical graph (3739 nodes, 2 splits). The difference is entirely in Metal compute buffers:
Memory Breakdown (from llama-server logs)
The buggy build leaves only 16 GB free on the GPU vs 29 GB free for stable — a 13 GB difference that maps exactly to the inflated
sched_reserve+alloc_compute_metabuffers.System-Level Impact
On a 64 GB M1 Max, the buggy build uses almost all available memory and triggers swap/compression.
Steps to Reproduce
Model Config Used
Bisection
Suggested Fix Location
The TurboFlash attention kernel's graph planning in
ggml-metal.mm(or whereversched_reservecomputes buffer sizes) is likely sizing the compute buffer for the full model/context instead of per-layer, or allocating scratch space that grows with context length when it shouldn't.Key clue: the
alloc_compute_metabuffer also blows up (248 → 5355 MiB), suggesting the issue is in the graph node allocation, not just the scheduler reservation.