Skip to content

Add challenge 95: Decode-Phase Attention (Medium)#248

Open
claude[bot] wants to merge 1 commit intomainfrom
add-challenge-95-decode-phase-attention
Open

Add challenge 95: Decode-Phase Attention (Medium)#248
claude[bot] wants to merge 1 commit intomainfrom
add-challenge-95-decode-phase-attention

Conversation

@claude
Copy link
Copy Markdown
Contributor

@claude claude Bot commented Apr 16, 2026

Summary

  • Adds challenge 95: Decode-Phase Attention (Medium difficulty)
  • Models the single-token-query attention used during autoregressive LLM inference decode steps: Q has shape (batch_size, num_q_heads, head_dim) — no sequence dimension — while K and V are the full KV cache (batch_size, num_kv_heads, cache_len, head_dim)
  • Supports Grouped Query Attention (GQA): num_q_heads / num_kv_heads query heads share each KV head
  • Performance test: LLaMA-3 8B-style config — batch_size=4, num_q_heads=32, num_kv_heads=8, cache_len=16,384, head_dim=128

Why this is interesting

This challenge teaches a key GPU programming concept: the same attention formula requires a completely different implementation strategy at decode time vs. training time. Training attention (e.g., GQA challenge #80, Flash Attention PR #232) is compute-bound with equal-length Q and KV; decode-phase attention is memory-bandwidth-bound with a single-token query streaming over the entire KV cache. Efficient decode kernels parallelize over batch/heads and reduce over cache_len, a pattern not covered by any existing challenge or open PR.

Test plan

  • All 6 starter files present (starter.cu, starter.pytorch.py, starter.triton.py, starter.jax.py, starter.cute.py, starter.mojo)
  • 10 functional test cases: edge cases (cache_len=1,2), zero inputs, MQA (kv_heads=1), GQA groups=2, MHA (kv_heads=q_heads), power-of-2 and non-power-of-2 cache lengths, realistic LLaMA-3 config
  • pre-commit run --all-files passes (black, isort, flake8, clang-format, mojo format)
  • Validated on NVIDIA Tesla T4 via run_challenge.py --action submit → "✓ All tests passed"
  • Checklist in CLAUDE.md verified

🤖 Generated with Claude Code

Single-token-query attention over a full KV cache, the dominant kernel in
autoregressive LLM decode steps. Supports Grouped Query Attention (GQA)
where multiple query heads share one KV head. Teaches the memory-bandwidth-
bound nature of decode-phase workloads, distinct from compute-bound training
attention.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants