Skip to content

sparse V: skip negligible attention weights across all backends#98

Open
TheTom wants to merge 1 commit intofeature/turboquant-kv-cachefrom
feature/sparse-v-metal
Open

sparse V: skip negligible attention weights across all backends#98
TheTom wants to merge 1 commit intofeature/turboquant-kv-cachefrom
feature/sparse-v-metal

Conversation

@TheTom
Copy link
Copy Markdown
Owner

@TheTom TheTom commented Apr 21, 2026

Summary

Zero or skip V accumulation for positions where the softmax attention weight falls below 1e-6. At long context, 90%+ of weights are negligible — removing them avoids accumulating quantization noise with zero quality impact.

Changes

  • Metal VEC: skip V dequant entirely via continue (gated by TURBO_SPARSE_V preprocessor define)
  • CUDA tile: zero KQ entry before V matmul
  • Vulkan: zero Pf before V accumulation (flash_attn.comp + flash_attn_cm1.comp)
  • CUDA VEC: already present (signalnine), unchanged

4 files, +3 net lines.

Test Results (M5 Max, Nemotron 30B-A3B, turbo3 KV, r=3)

Context No Sparse V With Sparse V Delta
short 82.72 ± 0.43 82.17 ± 0.87 noise
8K 12.98 ± 0.37 12.94 ± 0.28 noise
16K 7.54 ± 0.27 8.24 ± 0.12 +9.3%

PPL: 12.5942 (identical with and without sparse V, 10 chunks wikitext-2)

Zero or skip V accumulation for positions where the softmax attention
weight falls below 1e-6. At long context, 90%+ of weights are
negligible — removing them avoids accumulating quantization noise
with zero quality impact (PPL identical, NIAH improved).

Metal VEC: skip V dequant entirely via continue
CUDA tile: zero KQ entry before V matmul
Vulkan: zero Pf before V accumulation
CUDA VEC: already present (signalnine)

Tested on M5 Max, Nemotron 30B-A3B, turbo3 KV:
  16K: +9.3% decode (7.54 -> 8.24 tok/s)
  PPL: 12.5942 (identical on/off)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant