TurboFlash attention kernel: 13 GB excess Metal compute buffer allocation (commit 0d6b38aad)

### Environment
- **Hardware**: Apple M1 Max, 64 GB unified memory
- **OS**: macOS 26.3 (25D125)
- **Binary**: llama-server (turboquant fork, `feature/turboquant-kv-cache`)
- **Bad commit**: `8590cbff9` (Apr 9, 2026) — includes `0d6b38aad`
- **Good commit**: `a4e8af445` (Apr 7, 2026) — last build before TurboFlash

### Summary

Commit `0d6b38aad` ("metal: add TurboFlash attention kernel for turbo3 KV cache decode") causes the Metal compute buffer to be over-allocated by ~13 GB on Apple Silicon. The server runs correctly but consumes nearly all 64 GB of unified memory on a system where the stable build leaves ~17 GB free.

### Root Cause: Metal Compute Buffer Over-Allocation

Both builds load the identical model (Qwopus-MoE-35B-A3B, 21.24 GiB GGUF), identical KV cache (510 MiB, q8_0 K / turbo4 V), and identical graph (3739 nodes, 2 splits). The difference is entirely in Metal compute buffers:

```
                            BUGGY (8590cbff9)     STABLE (a4e8af445)     DELTA
─────────────────────────────────────────────────────────────────────────────────
sched_reserve MTL0:         8,480.87 MiB          489.00 MiB            +7,991 MiB (16.3×)
alloc_compute_meta MTL0:    5,355.99 MiB          248.10 MiB            +5,107 MiB (21.6×)
─────────────────────────────────────────────────────────────────────────────────
TOTAL EXCESS:                                                           ~13,098 MiB (+13 GB)

sched_reserve time:         813.16 ms             28.52 ms              28.5× slower
```

### Memory Breakdown (from llama-server logs)

```
BUGGY:  MTL0 | 53084 = 16064 free + (30805 = 21751 model + 572 context + 8480 compute) + 6214 unaccounted
STABLE: MTL0 | 53084 = 29164 free + (22813 = 21751 model + 572 context +  489 compute) + 1106 unaccounted
                         ─────                                            ────            ────
                        -13100 free                                      +7991 compute   +5108 unaccounted
```

The buggy build leaves only **16 GB free** on the GPU vs **29 GB free** for stable — a 13 GB difference that maps exactly to the inflated `sched_reserve` + `alloc_compute_meta` buffers.

### System-Level Impact

```
                            BUGGY           STABLE
System wired memory:        60 GB           47 GB           (+13 GB)
System free memory:         ~0 GB           12 GB
Projected device usage:     30,407 MiB      22,415 MiB      (+7,992 MiB)
Process RSS:                23,237 MB       23,419 MB       (similar — model dominates)
```

On a 64 GB M1 Max, the buggy build uses almost all available memory and triggers swap/compression.

### Steps to Reproduce

```bash
# Build buggy version
git clone https://github.com/TheTom/llama-cpp-turboquant.git ~/turboquant-test
cd ~/turboquant-test && git checkout 8590cbff9
cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release -DCMAKE_OSX_ARCHITECTURES=arm64
cmake --build build --config Release -j$(sysctl -n hw.ncpu)

# Start server (loads model on first request)
./build/bin/llama-server \
    --models-preset models_turbo3.ini --models-max 1 --mlock \
    --port 8000 --host 127.0.0.1 \
    --cont-batching --batch-size 2048 --ubatch-size 512 \
    --no-context-shift --timeout 1200

# Trigger model load
curl -s http://127.0.0.1:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{"model": "Qwopus-MoE-35B-A3B-APEX-I-Quality", "messages": [{"role":"user","content":"hi"}], "max_tokens": 1}'

# Check server log for sched_reserve size — should be ~489 MiB, not ~8480 MiB
grep sched_reserve llama-server.log
```

### Model Config Used

```ini
[Qwopus-MoE-35B-A3B-APEX-I-Quality]
model           = Qwopus-MoE-35B-A3B-APEX-I-Quality.gguf   ; 21.24 GiB, MoE 35B (~3B active)
n-gpu-layers    = 99
flash-attn      = on
ctx-size        = 65536
parallel        = 1
cache-type-k    = q8_0
cache-type-v    = turbo4
```

### Bisection

```
a4e8af445 2026-04-07 perf: TQ4_1S native kernel 3.5× faster + AMD arch dispatch  ← WORKS (489 MiB compute)
0d6b38aad 2026-04-08 metal: add TurboFlash attention kernel for turbo3 KV cache decode  ← BROKEN (8480 MiB compute)
```

### Suggested Fix Location

The TurboFlash attention kernel's graph planning in `ggml-metal.mm` (or wherever `sched_reserve` computes buffer sizes) is likely sizing the compute buffer for the full model/context instead of per-layer, or allocating scratch space that grows with context length when it shouldn't.

Key clue: the `alloc_compute_meta` buffer also blows up (248 → 5355 MiB), suggesting the issue is in the graph node allocation, not just the scheduler reservation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

TurboFlash attention kernel: 13 GB excess Metal compute buffer allocation (commit 0d6b38aad) #77

Environment

Summary

Root Cause: Metal Compute Buffer Over-Allocation

Memory Breakdown (from llama-server logs)

System-Level Impact

Steps to Reproduce

Model Config Used

Bisection

Suggested Fix Location

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

TurboFlash attention kernel: 13 GB excess Metal compute buffer allocation (commit 0d6b38aad) #77

Description

Environment

Summary

Root Cause: Metal Compute Buffer Over-Allocation

Memory Breakdown (from llama-server logs)

System-Level Impact

Steps to Reproduce

Model Config Used

Bisection

Suggested Fix Location

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions