sparse V: skip negligible attention weights across all backends by TheTom · Pull Request #98 · TheTom/llama-cpp-turboquant

TheTom · 2026-04-21T14:07:12Z

Summary

Zero or skip V accumulation for positions where the softmax attention weight falls below 1e-6. At long context, 90%+ of weights are negligible — removing them avoids accumulating quantization noise with zero quality impact.

Changes

Metal VEC: skip V dequant entirely via continue (gated by TURBO_SPARSE_V preprocessor define)
CUDA tile: zero KQ entry before V matmul
Vulkan: zero Pf before V accumulation (flash_attn.comp + flash_attn_cm1.comp)
CUDA VEC: already present (signalnine), unchanged

4 files, +3 net lines.

Test Results (M5 Max, Nemotron 30B-A3B, turbo3 KV, r=3)

Context	No Sparse V	With Sparse V	Delta
short	82.72 ± 0.43	82.17 ± 0.87	noise
8K	12.98 ± 0.37	12.94 ± 0.28	noise
16K	7.54 ± 0.27	8.24 ± 0.12	+9.3%

PPL: 12.5942 (identical with and without sparse V, 10 chunks wikitext-2)

Zero or skip V accumulation for positions where the softmax attention weight falls below 1e-6. At long context, 90%+ of weights are negligible — removing them avoids accumulating quantization noise with zero quality impact (PPL identical, NIAH improved). Metal VEC: skip V dequant entirely via continue CUDA tile: zero KQ entry before V matmul Vulkan: zero Pf before V accumulation CUDA VEC: already present (signalnine) Tested on M5 Max, Nemotron 30B-A3B, turbo3 KV: 16K: +9.3% decode (7.54 -> 8.24 tok/s) PPL: 12.5942 (identical on/off)

github-actions Bot added Nvidia GPU ggml Apple Metal Vulkan labels Apr 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

sparse V: skip negligible attention weights across all backends#98

sparse V: skip negligible attention weights across all backends#98
TheTom wants to merge 1 commit intofeature/turboquant-kv-cachefrom
feature/sparse-v-metal

TheTom commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

TheTom commented Apr 21, 2026

Summary

Changes

Test Results (M5 Max, Nemotron 30B-A3B, turbo3 KV, r=3)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant