diff --git a/docs/papers/triattention-hybrid-recipe.md b/docs/papers/triattention-hybrid-recipe.md new file mode 100644 index 000000000..3a1e6ae51 --- /dev/null +++ b/docs/papers/triattention-hybrid-recipe.md @@ -0,0 +1,193 @@ +# TriAttention V3 — Hybrid Model Recipe + +**James Tervit, Chronara Group** + +Response to the open question in triattention-v3.md Section 5: + +> "On the Qwen3.5 hybrid Mamba+Attention architectures, perplexity transfers cleanly but needle retrieval silently fails at middle and end positions even under V3. The final section requests community input on the missing pieces." + +This document proposes two targeted fixes and a validation recipe. + +--- + +## Diagnosis: Why Hybrid Models Fail + +The NIAH failure on Qwen3.5-27B is not a bug in V3 — it's a calibration problem. V3 applies the same eviction pressure (10%) to a model where each KV token carries 4x more retrieval weight. + +### The math + +| Model | Attention layers | Total layers | Attention fraction | KV per-token criticality | +|-------|-----------------|-------------|-------------------|-------------------------| +| Qwen2.5-7B | 32 | 32 | 100% | 1x (baseline) | +| Qwen3.5-27B | 16 | 64 | 25% | **4x** | +| Qwen3.5-35B-A3B | 10 | 40 | 25% | **4x** | + +On Qwen2.5-7B, evicting 10% of tokens removes 10% of the model's ability to attend to that information. The other 90% of attention layers still see the remaining tokens — redundancy is high. + +On Qwen3.5-27B, evicting 10% of tokens removes 10% of the information from only 16 attention layers. But those 16 layers are the **only** mechanism the model has for position-dependent retrieval (Mamba layers are position-agnostic by design). Each evicted token is a 4x larger hole in the attention fabric. + +This explains every observation in the V3 paper: +- **PPL is fine** because perplexity is a near-token metric dominated by Mamba layers, which aren't affected by KV eviction +- **NIAH fails at middle/end** because those positions depend on long-range attention retrieval through the sparse attention layers +- **Start position passes** because the prefix protection (128 tokens) saves the needle +- **Boundary skip doesn't help** because the problem isn't which layers contribute to scoring — it's that too many tokens are being evicted from too few attention layers + +--- + +## Fix 1: Scale Budget by Attention Fraction + +The eviction budget should be proportional to the model's attention density, not a fixed percentage. + +### Formula + +``` +attention_fraction = n_attention_layers / n_total_layers +eviction_rate = (1.0 - raw_budget) * attention_fraction +effective_budget = 1.0 - eviction_rate +``` + +### Concrete values + +| Model | raw_budget | attention_fraction | effective_budget | Tokens evicted | +|-------|-----------|-------------------|-----------------|----------------| +| Qwen2.5-7B | 0.90 | 1.00 | **0.90** (10% eviction) | unchanged | +| Qwen3.5-27B | 0.90 | 0.25 | **0.975** (2.5% eviction) | 4x fewer | +| Qwen3.5-35B-A3B | 0.90 | 0.25 | **0.975** (2.5% eviction) | 4x fewer | + +### Implementation sketch (in llama-triattention.cpp) + +```c +// In triattention_evict(), before computing n_to_evict: + +float attention_fraction = 1.0f; +if (model->hparams.ssm_d_state > 0) { + // Hybrid model — count attention vs total layers + int n_attn = 0; + for (int il = 0; il < model->hparams.n_layer; il++) { + if (is_attention_layer(model, il)) n_attn++; + } + attention_fraction = (float)n_attn / (float)model->hparams.n_layer; +} + +float effective_evict_rate = evict_rate * attention_fraction; +int n_to_evict = (int)(n_candidates * effective_evict_rate); +``` + +The `is_attention_layer()` check can use `full_attention_interval` from the model config — on Qwen3.5, attention layers are at indices where `il % full_attention_interval == 0`. + +### Why 2.5% eviction is still useful + +At 2.5% eviction × 4.6x TurboQuant compression: +- 32K context: saves ~800 tokens of KV + 4.6x compression on the rest +- On reasoning workloads with thinking traces: the redundant `` tokens score lowest and are preferentially evicted — the 2.5% that gets evicted is the 2.5% that matters least +- Combined with TQBridge: even 2.5% fewer tokens per transfer adds up across 32 layers × thousands of tokens + +--- + +## Fix 2: Partial RoPE Frequency Count + +Qwen3.5 uses partial RoPE — only `n_rot` dimensions (64 out of 256) have rotary position embeddings. The remaining 192 dimensions have no position encoding. + +The trig scoring formula computes a phase-alignment score across frequency bins: + +``` +score = sum_f (A_f * cos_sum_f - B_f * sin_sum_f) +``` + +where `f` ranges over `head_dim / 2 = 128` frequency bins. + +**The problem**: 96 of those 128 bins contribute zero signal because the corresponding dimensions have no RoPE rotation. The score averages 32 bins of real signal with 96 bins of noise, reducing the signal-to-noise ratio by 4x. + +### Fix + +```c +// Current (in triattention scoring loop): +int freq_count = head_dim / 2; // = 128 + +// Fixed: +int n_rot = model->hparams.n_rot; // = 64 for Qwen3.5, = head_dim for standard +int freq_count = n_rot / 2; // = 32 for Qwen3.5, = 64 for standard +``` + +This makes the scoring 4x less noisy on Qwen3.5 without affecting standard transformers where `n_rot == head_dim`. + +--- + +## Validation Recipe + +Run these tests in order. Each builds on the previous result. + +### Step 1: Apply Fix 1 only (budget scaling) + +```bash +# Qwen3.5-27B, 32K context, V3 with scaled budget +# The effective budget at attention_fraction=0.25 should be ~0.975 + +./build-test/bin/llama-perplexity \ + -m Qwen3.5-27B-Q8_0.gguf \ + -f wikitext-2-raw/wiki.test.raw \ + -b 512 --chunks 3 -c 32768 \ + --triatt-budget 31457 # 32768 * 0.96 ≈ 31457 + +# Expected: PPL ≈ 7.47 (same as current V3) +``` + +Then NIAH: +```bash +# Middle position (65000 chars) — the failing case +./build-test/bin/llama-completion \ + -m Qwen3.5-27B-Q8_0.gguf \ + -f niah_prompt_mid.txt \ + -n 1024 -c 32768 --temp 0 -no-cnv --no-display-prompt \ + --triatt-budget 31457 --triatt-hybrid 2 --triatt-prefix 128 + +# Expected: PASS (or at least PARTIAL instead of FAIL) +``` + +### Step 2: Apply Fix 2 only (partial RoPE) + +Keep the original 90% budget but fix the frequency count. This tests whether the noise reduction alone recovers NIAH. + +### Step 3: Apply both fixes + +The expected best result. Both fixes are orthogonal — budget scaling reduces eviction pressure, frequency fix improves eviction quality. + +### Step 4: Stack with TurboQuant+ + +```bash +# Full stack: TQ+ (q8_0 K + turbo3 V) + V3 (scaled budget + partial RoPE fix) +./build-test/bin/llama-perplexity \ + -m Qwen3.5-27B-Q8_0.gguf \ + -f wikitext-2-raw/wiki.test.raw \ + -b 512 --chunks 3 -c 32768 \ + -ctk q8_0 -ctv turbo3 \ + --triatt-budget 31457 --triatt-hybrid 2 --triatt-prefix 128 + +# Expected: PPL within +1% of f16 baseline +``` + +--- + +## For TQBridge Integration + +If both fixes work, the combined compression for distributed inference: + +| Workload | TurboQuant | TriAttention V3 | Combined | Per-token over wire | +|----------|-----------|-----------------|----------|---------------------| +| Standard 7B, 32K | 4.6x | 1.11x (90%) | 5.1x | ~10KB | +| Hybrid 27B, 32K | 4.6x | 1.03x (97.5%) | 4.7x | ~11KB | +| Reasoning 7B, 32K | 4.6x | ~5x (est.) | ~23x | ~2.2KB | + +The reasoning case is where TQBridge + TriAttention V3 delivers the most value — thinking traces generate thousands of redundant tokens that TriAttention evicts before TQBridge compresses and transfers. At 2.2KB per token, a 27B model's KV cache transfers comfortably over WiFi. + +--- + +## Summary + +Two fixes, both derived from the model architecture rather than tuning: + +1. **Scale eviction by attention fraction** — fewer attention layers means each token is more critical. Don't evict 10% when each token does 4x the work. + +2. **Fix frequency count for partial RoPE** — don't average 32 bins of signal with 96 bins of noise. + +Neither fix requires changes to the scoring formula itself. V3's trig scoring is correct — it just needs the right inputs (frequency count) and the right budget (scaled to attention density).