You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
TurboQuant+ is an implementation of Google's TurboQuant paper (ICLR 2026, arXiv 2504.19874) — a KV cache compression algorithm that achieves up to 4.9× memory reduction for transformer inference.
I built this to solve a specific problem: running large language models locally on Apple Silicon with long context windows. KV cache is the main memory bottleneck — at 128K context on a 27B model, the cache alone eats gigabytes and triggers expensive reprocessing when it overflows.
Current state
v1 is working end-to-end. The Python prototype validates the math (141 tests, 100% coverage), and the llama.cpp C port with Metal kernels runs real inference on M5 Max. Both Qwen 3.5 35B-A3B (MoE) and Qwopus v2 27B (dense) generate coherent text with --cache-type-k turbo3.
The compression target is met (4.9×). The main open issue is a 13-35× speed regression in the Metal shader due to an unoptimized rotation matrix multiply — this is being actively worked on.
What I'm looking for
Metal shader optimization help — the dequantize kernel does a full 128×128 matvec per chunk instead of per block. Someone with Metal compute shader experience could probably 10× this.
CUDA port — the C quantize/dequantize code is ready, just needs CUDA kernels.
Benchmark contributions — perplexity evaluation, NIAH testing on different models/context lengths.
Feedback on the approach — is there a better way to handle the rotation matrix in the GPU kernel?
The "Plus"
The base TurboQuant paper is the starting point. I have ideas for extensions:
Adaptive bit allocation per layer (early layers more diffuse, later more sparse)
Temporal decay compression (4-bit recent, 2-bit older, 1-bit very old context)
Expert-aware MoE compression (97% idle experts compressed more aggressively)
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
What is this?
TurboQuant+ is an implementation of Google's TurboQuant paper (ICLR 2026, arXiv 2504.19874) — a KV cache compression algorithm that achieves up to 4.9× memory reduction for transformer inference.
I built this to solve a specific problem: running large language models locally on Apple Silicon with long context windows. KV cache is the main memory bottleneck — at 128K context on a 27B model, the cache alone eats gigabytes and triggers expensive reprocessing when it overflows.
Current state
v1 is working end-to-end. The Python prototype validates the math (141 tests, 100% coverage), and the llama.cpp C port with Metal kernels runs real inference on M5 Max. Both Qwen 3.5 35B-A3B (MoE) and Qwopus v2 27B (dense) generate coherent text with
--cache-type-k turbo3.The compression target is met (4.9×). The main open issue is a 13-35× speed regression in the Metal shader due to an unoptimized rotation matrix multiply — this is being actively worked on.
What I'm looking for
The "Plus"
The base TurboQuant paper is the starting point. I have ideas for extensions:
If any of these interest you, let's talk.
Links
Beta Was this translation helpful? Give feedback.
All reactions