Skip to content

Investigate TurboQuant for KV cache compression (llama.cpp + MLX) #115

@TimPietruskyRunPod

Description

@TimPietruskyRunPod

Context

TurboQuant is a Google Research paper (ICLR 2026) that applies random orthogonal rotation (Walsh-Hadamard Transform) before quantizing KV cache vectors. It's data-oblivious (no calibration needed) and achieves near-optimal compression with minimal quality loss.

llama.cpp — Core rotation already merged

The core rotation idea is merged in ggml-org/llama.cpp#21038 (April 1, 2026) and is enabled by default in current llama.cpp master. It improves all existing KV cache quant types automatically.

We currently use -ctk q8_0 -ctv q8_0. With rotation enabled, we could potentially drop to q4_0 and still maintain near-lossless quality:

Cache type Compression vs f16 PPL impact
q8_0 (current) ~1.9x negligible
q4_0 + rotation ~3.6x negligible
tbq3_0 (proposed, not merged) ~5.2x slight degradation

Going from q8_0 → q4_0 with rotation would roughly halve KV cache VRAM, allowing significantly more context length on the same hardware. All kvCacheMbPer1kTokens values would need recalibration.

Caveat: Known interaction with weight quantization — some architectures (notably Qwen2.5) break with symmetric TurboQuant KV compression when weights are already quantized. Asymmetric K/V (e.g., -ctk q8_0 -ctv q4_0) may be a safer middle ground.

Dedicated TBQ3_0/TBQ4_0 types proposed in ggml-org/llama.cpp#21089 but not merged yet.

MLX — Not yet officially supported

  • MLX core: Open PR ml-explore/mlx#3328 for native Metal kernel — not merged, maintainers prefer generic quantized SDPA approach
  • mlx-lm: Open PR ml-explore/mlx-lm#1067 for turbo_kv_bits=3 — not merged
  • Community implementations exist (e.g., arozanov/turboquant-mlx with 4.6x compression), but nothing official

On Apple Silicon at 32K context, TurboQuant achieves ~4x speedup due to nearly constant attention latency.

TODO

  • Check which llama.cpp version we're building — confirm it includes PR #21038
  • Benchmark q4_0 KV cache (with rotation) on our model set — measure perplexity and VRAM
  • Test asymmetric K/V (-ctk q8_0 -ctv q4_0) as a safer option
  • Recalibrate kvCacheMbPer1kTokens values if we switch cache types
  • Monitor mlx-lm PR #1067 for upstream merge
  • Update entrypoint and VRAM budget calculations accordingly

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions