Context
TurboQuant is a Google Research paper (ICLR 2026) that applies random orthogonal rotation (Walsh-Hadamard Transform) before quantizing KV cache vectors. It's data-oblivious (no calibration needed) and achieves near-optimal compression with minimal quality loss.
llama.cpp — Core rotation already merged
The core rotation idea is merged in ggml-org/llama.cpp#21038 (April 1, 2026) and is enabled by default in current llama.cpp master. It improves all existing KV cache quant types automatically.
We currently use -ctk q8_0 -ctv q8_0. With rotation enabled, we could potentially drop to q4_0 and still maintain near-lossless quality:
| Cache type |
Compression vs f16 |
PPL impact |
| q8_0 (current) |
~1.9x |
negligible |
| q4_0 + rotation |
~3.6x |
negligible |
| tbq3_0 (proposed, not merged) |
~5.2x |
slight degradation |
Going from q8_0 → q4_0 with rotation would roughly halve KV cache VRAM, allowing significantly more context length on the same hardware. All kvCacheMbPer1kTokens values would need recalibration.
Caveat: Known interaction with weight quantization — some architectures (notably Qwen2.5) break with symmetric TurboQuant KV compression when weights are already quantized. Asymmetric K/V (e.g., -ctk q8_0 -ctv q4_0) may be a safer middle ground.
Dedicated TBQ3_0/TBQ4_0 types proposed in ggml-org/llama.cpp#21089 but not merged yet.
MLX — Not yet officially supported
- MLX core: Open PR ml-explore/mlx#3328 for native Metal kernel — not merged, maintainers prefer generic quantized SDPA approach
- mlx-lm: Open PR ml-explore/mlx-lm#1067 for
turbo_kv_bits=3 — not merged
- Community implementations exist (e.g.,
arozanov/turboquant-mlx with 4.6x compression), but nothing official
On Apple Silicon at 32K context, TurboQuant achieves ~4x speedup due to nearly constant attention latency.
TODO
Context
TurboQuant is a Google Research paper (ICLR 2026) that applies random orthogonal rotation (Walsh-Hadamard Transform) before quantizing KV cache vectors. It's data-oblivious (no calibration needed) and achieves near-optimal compression with minimal quality loss.
llama.cpp — Core rotation already merged
The core rotation idea is merged in ggml-org/llama.cpp#21038 (April 1, 2026) and is enabled by default in current llama.cpp master. It improves all existing KV cache quant types automatically.
We currently use
-ctk q8_0 -ctv q8_0. With rotation enabled, we could potentially drop to q4_0 and still maintain near-lossless quality:Going from q8_0 → q4_0 with rotation would roughly halve KV cache VRAM, allowing significantly more context length on the same hardware. All
kvCacheMbPer1kTokensvalues would need recalibration.Caveat: Known interaction with weight quantization — some architectures (notably Qwen2.5) break with symmetric TurboQuant KV compression when weights are already quantized. Asymmetric K/V (e.g.,
-ctk q8_0 -ctv q4_0) may be a safer middle ground.Dedicated TBQ3_0/TBQ4_0 types proposed in ggml-org/llama.cpp#21089 but not merged yet.
MLX — Not yet officially supported
turbo_kv_bits=3— not mergedarozanov/turboquant-mlxwith 4.6x compression), but nothing officialOn Apple Silicon at 32K context, TurboQuant achieves ~4x speedup due to nearly constant attention latency.
TODO
-ctk q8_0 -ctv q4_0) as a safer optionkvCacheMbPer1kTokensvalues if we switch cache types