Welcome to TurboQuant+ — KV cache compression for local LLM inference #25

TheTom · 2026-03-25T13:00:42Z

TheTom
Mar 25, 2026
Maintainer

What is this?

TurboQuant+ is an implementation of Google's TurboQuant paper (ICLR 2026, arXiv 2504.19874) — a KV cache compression algorithm that achieves up to 4.9× memory reduction for transformer inference.

I built this to solve a specific problem: running large language models locally on Apple Silicon with long context windows. KV cache is the main memory bottleneck — at 128K context on a 27B model, the cache alone eats gigabytes and triggers expensive reprocessing when it overflows.

Current state

v1 is working end-to-end. The Python prototype validates the math (141 tests, 100% coverage), and the llama.cpp C port with Metal kernels runs real inference on M5 Max. Both Qwen 3.5 35B-A3B (MoE) and Qwopus v2 27B (dense) generate coherent text with --cache-type-k turbo3.

The compression target is met (4.9×). The main open issue is a 13-35× speed regression in the Metal shader due to an unoptimized rotation matrix multiply — this is being actively worked on.

What I'm looking for

Metal shader optimization help — the dequantize kernel does a full 128×128 matvec per chunk instead of per block. Someone with Metal compute shader experience could probably 10× this.
CUDA port — the C quantize/dequantize code is ready, just needs CUDA kernels.
Benchmark contributions — perplexity evaluation, NIAH testing on different models/context lengths.
Feedback on the approach — is there a better way to handle the rotation matrix in the GPU kernel?

The "Plus"

The base TurboQuant paper is the starting point. I have ideas for extensions:

Adaptive bit allocation per layer (early layers more diffuse, later more sparse)
Temporal decay compression (4-bit recent, 2-bit older, 1-bit very old context)
Expert-aware MoE compression (97% idle experts compressed more aggressively)

If any of these interest you, let's talk.

Links

Paper (arXiv 2504.19874, ICLR 2026)
Google Research Blog
Prince Canuma's MLX implementation (independent validation)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Welcome to TurboQuant+ — KV cache compression for local LLM inference #25

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

Welcome to TurboQuant+ — KV cache compression for local LLM inference #25

Uh oh!

TheTom Mar 25, 2026 Maintainer

What is this?

Current state

What I'm looking for

The "Plus"

Links

Replies: 0 comments

TheTom
Mar 25, 2026
Maintainer