Skip to content

TurboFlash attention kernel: 13 GB excess Metal compute buffer allocation (commit 0d6b38aad) #77

@shivam2014

Description

@shivam2014

Environment

  • Hardware: Apple M1 Max, 64 GB unified memory
  • OS: macOS 26.3 (25D125)
  • Binary: llama-server (turboquant fork, feature/turboquant-kv-cache)
  • Bad commit: 8590cbff9 (Apr 9, 2026) — includes 0d6b38aad
  • Good commit: a4e8af445 (Apr 7, 2026) — last build before TurboFlash

Summary

Commit 0d6b38aad ("metal: add TurboFlash attention kernel for turbo3 KV cache decode") causes the Metal compute buffer to be over-allocated by ~13 GB on Apple Silicon. The server runs correctly but consumes nearly all 64 GB of unified memory on a system where the stable build leaves ~17 GB free.

Root Cause: Metal Compute Buffer Over-Allocation

Both builds load the identical model (Qwopus-MoE-35B-A3B, 21.24 GiB GGUF), identical KV cache (510 MiB, q8_0 K / turbo4 V), and identical graph (3739 nodes, 2 splits). The difference is entirely in Metal compute buffers:

                            BUGGY (8590cbff9)     STABLE (a4e8af445)     DELTA
─────────────────────────────────────────────────────────────────────────────────
sched_reserve MTL0:         8,480.87 MiB          489.00 MiB            +7,991 MiB (16.3×)
alloc_compute_meta MTL0:    5,355.99 MiB          248.10 MiB            +5,107 MiB (21.6×)
─────────────────────────────────────────────────────────────────────────────────
TOTAL EXCESS:                                                           ~13,098 MiB (+13 GB)

sched_reserve time:         813.16 ms             28.52 ms              28.5× slower

Memory Breakdown (from llama-server logs)

BUGGY:  MTL0 | 53084 = 16064 free + (30805 = 21751 model + 572 context + 8480 compute) + 6214 unaccounted
STABLE: MTL0 | 53084 = 29164 free + (22813 = 21751 model + 572 context +  489 compute) + 1106 unaccounted
                         ─────                                            ────            ────
                        -13100 free                                      +7991 compute   +5108 unaccounted

The buggy build leaves only 16 GB free on the GPU vs 29 GB free for stable — a 13 GB difference that maps exactly to the inflated sched_reserve + alloc_compute_meta buffers.

System-Level Impact

                            BUGGY           STABLE
System wired memory:        60 GB           47 GB           (+13 GB)
System free memory:         ~0 GB           12 GB
Projected device usage:     30,407 MiB      22,415 MiB      (+7,992 MiB)
Process RSS:                23,237 MB       23,419 MB       (similar — model dominates)

On a 64 GB M1 Max, the buggy build uses almost all available memory and triggers swap/compression.

Steps to Reproduce

# Build buggy version
git clone https://github.com/TheTom/llama-cpp-turboquant.git ~/turboquant-test
cd ~/turboquant-test && git checkout 8590cbff9
cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release -DCMAKE_OSX_ARCHITECTURES=arm64
cmake --build build --config Release -j$(sysctl -n hw.ncpu)

# Start server (loads model on first request)
./build/bin/llama-server \
    --models-preset models_turbo3.ini --models-max 1 --mlock \
    --port 8000 --host 127.0.0.1 \
    --cont-batching --batch-size 2048 --ubatch-size 512 \
    --no-context-shift --timeout 1200

# Trigger model load
curl -s http://127.0.0.1:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{"model": "Qwopus-MoE-35B-A3B-APEX-I-Quality", "messages": [{"role":"user","content":"hi"}], "max_tokens": 1}'

# Check server log for sched_reserve size — should be ~489 MiB, not ~8480 MiB
grep sched_reserve llama-server.log

Model Config Used

[Qwopus-MoE-35B-A3B-APEX-I-Quality]
model           = Qwopus-MoE-35B-A3B-APEX-I-Quality.gguf   ; 21.24 GiB, MoE 35B (~3B active)
n-gpu-layers    = 99
flash-attn      = on
ctx-size        = 65536
parallel        = 1
cache-type-k    = q8_0
cache-type-v    = turbo4

Bisection

a4e8af445 2026-04-07 perf: TQ4_1S native kernel 3.5× faster + AMD arch dispatch  ← WORKS (489 MiB compute)
0d6b38aad 2026-04-08 metal: add TurboFlash attention kernel for turbo3 KV cache decode  ← BROKEN (8480 MiB compute)

Suggested Fix Location

The TurboFlash attention kernel's graph planning in ggml-metal.mm (or wherever sched_reserve computes buffer sizes) is likely sizing the compute buffer for the full model/context instead of per-layer, or allocating scratch space that grows with context length when it shouldn't.

Key clue: the alloc_compute_meta buffer also blows up (248 → 5355 MiB), suggesting the issue is in the graph node allocation, not just the scheduler reservation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions