Investigate TurboQuant for KV cache compression (llama.cpp + MLX)

## Context

[TurboQuant](https://arxiv.org/abs/2504.19874) is a Google Research paper (ICLR 2026) that applies random orthogonal rotation (Walsh-Hadamard Transform) before quantizing KV cache vectors. It's data-oblivious (no calibration needed) and achieves near-optimal compression with minimal quality loss.

## llama.cpp — Core rotation already merged

The core rotation idea is **merged** in [ggml-org/llama.cpp#21038](https://github.com/ggml-org/llama.cpp/pull/21038) (April 1, 2026) and is **enabled by default** in current llama.cpp master. It improves all existing KV cache quant types automatically.

We currently use `-ctk q8_0 -ctv q8_0`. With rotation enabled, we could potentially **drop to q4_0** and still maintain near-lossless quality:

| Cache type | Compression vs f16 | PPL impact |
|---|---|---|
| q8_0 (current) | ~1.9x | negligible |
| q4_0 + rotation | ~3.6x | negligible |
| tbq3_0 (proposed, not merged) | ~5.2x | slight degradation |

Going from q8_0 → q4_0 with rotation would **roughly halve KV cache VRAM**, allowing significantly more context length on the same hardware. All `kvCacheMbPer1kTokens` values would need recalibration.

**Caveat:** Known interaction with weight quantization — some architectures (notably Qwen2.5) break with symmetric TurboQuant KV compression when weights are already quantized. Asymmetric K/V (e.g., `-ctk q8_0 -ctv q4_0`) may be a safer middle ground.

Dedicated TBQ3_0/TBQ4_0 types proposed in [ggml-org/llama.cpp#21089](https://github.com/ggml-org/llama.cpp/pull/21089) but not merged yet.

## MLX — Not yet officially supported

- MLX core: Open PR [ml-explore/mlx#3328](https://github.com/ml-explore/mlx/pull/3328) for native Metal kernel — not merged, maintainers prefer generic quantized SDPA approach
- mlx-lm: Open PR [ml-explore/mlx-lm#1067](https://github.com/ml-explore/mlx-lm/pull/1067) for `turbo_kv_bits=3` — not merged
- Community implementations exist (e.g., `arozanov/turboquant-mlx` with 4.6x compression), but nothing official

On Apple Silicon at 32K context, TurboQuant achieves ~4x speedup due to nearly constant attention latency.

## TODO

- [ ] Check which llama.cpp version we're building — confirm it includes PR #21038
- [ ] Benchmark q4_0 KV cache (with rotation) on our model set — measure perplexity and VRAM
- [ ] Test asymmetric K/V (`-ctk q8_0 -ctv q4_0`) as a safer option
- [ ] Recalibrate `kvCacheMbPer1kTokens` values if we switch cache types
- [ ] Monitor mlx-lm PR #1067 for upstream merge
- [ ] Update entrypoint and VRAM budget calculations accordingly

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate TurboQuant for KV cache compression (llama.cpp + MLX) #115

Context

llama.cpp — Core rotation already merged

MLX — Not yet officially supported

TODO

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Cache type	Compression vs f16	PPL impact
q8_0 (current)	~1.9x	negligible
q4_0 + rotation	~3.6x	negligible
tbq3_0 (proposed, not merged)	~5.2x	slight degradation

Investigate TurboQuant for KV cache compression (llama.cpp + MLX) #115

Description

Context

llama.cpp — Core rotation already merged

MLX — Not yet officially supported

TODO

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions