[FEATURE SUPPORT] Add Triton decode support with KV-cache APIs by LoserCheems · Pull Request #271 · HKUSTDial/flash-sparse-attention

LoserCheems · 2026-04-26T03:47:10Z

Summary

Add Triton-based decode support for flash sparse attention across dense, sparse, and gated attention variants.
Extend the library with decode paths that support KV cache inputs for autoregressive inference workloads.
This branch also standardizes decode naming and parameter handling so the new kernels fit the existing Triton interface more cleanly.

Design

The feature adds dedicated Triton decode kernels for dense, sparse, and gated attention instead of overloading the existing forward path.
Decode-specific launch configuration and grid selection are handled separately to match single-token / KV-cache inference behavior.
Kernel and launch caching were consolidated through shared cache utilities so device-specific configuration can be reused efficiently across decode and existing Triton paths.
Alternatives considered:
- Reusing the forward kernels directly for decode. This would have kept the surface smaller, but it would not model KV-cache decode behavior cleanly and would make launch/config specialization harder.
- Implementing decode support only for one attention mode first. This was rejected in favor of keeping dense, sparse, and gated interfaces aligned.

Changes

Add Triton decode kernels for:
- dense attention
- sparse attention
- gated attention
Add public KV-cache decode APIs for:
- flash_dense_attn_with_kvcache_func
- flash_dense_attn_varlen_with_kvcache_func
- flash_sparse_attn_with_kvcache_func
- flash_sparse_attn_varlen_with_kvcache_func
- flash_gated_attn_with_kvcache_func
- flash_gated_attn_varlen_with_kvcache_func
Export the new KV-cache functions from the package top-level API.
Add support for optional preallocated output and LSE buffers in decode functions.
Rename forward-combine terminology to decode-combine for consistency with the new execution path.
Standardize decode-related parameter naming and simplify decode call signatures, including cleanup of unused parameters in varlen decode flows.

Implementation Notes

New Triton decode kernels were introduced for dense, sparse, and gated attention, with separate base and variable-length decode paths.
KV-cache support is implemented as dedicated decode entry points rather than as a thin wrapper over the regular forward path.
Shared cache utilities were added/refined to cache Triton launchers, launch configs, grid factories, and device-architecture-aware kernel setup.
Input validation was extended with decode-specific checks, including validation for optional output tensors.
Several internal refactors were included to improve decode readability and consistency:
- keyword-argument based decode calls
- decode-specific launch/grid helpers
- unified naming such as scale_log2 and decode-combine terminology
- removal of unused decode parameters and simplified tensor-shape handling

Tests

Added and updated decode-focused tests for:
- dense base decode
- dense varlen decode
- sparse base decode
- sparse varlen decode
- gated base decode
- gated varlen decode
Tests cover both normal decode execution and paths using preallocated output buffers.
Benchmark coverage was also updated to exercise KV-cache decode variants.

Docs

No user-facing documentation or example files were changed in this branch.
The change is currently covered by code-level API exposure and tests.

Checklist

Linked issue provided [FEATURE REQUEST] Next-Generation Trainable Sparse Attention Mechanism #219 [FEATURE REQUEST] Triton-based efficient multi-platform, multi-variant attention #222
API stable
Tests added or updated
Docs added or updated
No known performance regressions

…clarity Co-authored-by: Copilot <[email protected]>

Co-authored-by: Copilot <[email protected]>

…rieval Co-authored-by: Copilot <[email protected]>

…s retrieval Co-authored-by: Copilot <[email protected]>

Co-authored-by: Copilot <[email protected]>

…s retrieval Co-authored-by: Copilot <[email protected]>

…ton's launcher expectations Co-authored-by: Copilot <[email protected]>

…laration order Co-authored-by: Copilot <[email protected]>

…streamline caching mechanisms

…kernel wrapping Co-authored-by: Copilot <[email protected]>

…ernel wrapping Co-authored-by: Copilot <[email protected]>

…gement Co-authored-by: Copilot <[email protected]>

…agement Co-authored-by: Copilot <[email protected]>

Co-authored-by: Copilot <[email protected]>

…management and add get_dec_grid function for decode kernel

… architecture retrieval and add get_dec_dense_launch_config for decode dense kernel

…e_dec.py

… performance and clarity Co-authored-by: Copilot <[email protected]>

…roved device handling

… stride handling and output summation Co-authored-by: Copilot <[email protected]>

…rameter handling and clarity Co-authored-by: Copilot <[email protected]>

…roved parameter handling and clarity

…e handling with cache modifiers

…able length support Co-authored-by: Copilot <[email protected]>

…ctions

…hitecture parameters Co-authored-by: Copilot <[email protected]>

…ntion kernels Co-authored-by: Copilot <[email protected]>

…rchitecture parameters

…tion kernels

…rchitecture parameters Co-authored-by: Copilot <[email protected]>

Co-authored-by: Copilot <[email protected]>

…, and gated attention Co-authored-by: Copilot <[email protected]>

…ated attention Co-authored-by: Copilot <[email protected]>

…utput buffer support Co-authored-by: Copilot <[email protected]>

…evice checks for cumulative sequence lengths and sequences Co-authored-by: Copilot <[email protected]>

…rs and simplify tensor shapes Co-authored-by: Copilot <[email protected]>

…ers and simplify tensor shapes Co-authored-by: Copilot <[email protected]>

…rs and simplify tensor shapes Co-authored-by: Copilot <[email protected]>

…ers and update tensor shape descriptions Co-authored-by: Copilot <[email protected]>

…unused parameters Co-authored-by: Copilot <[email protected]>

Co-authored-by: Copilot <[email protected]>

…eadability Co-authored-by: Copilot <[email protected]>

…nsistency Co-authored-by: Copilot <[email protected]>

Copilot

Pull request overview

This PR adds Triton-based decode (single-token) attention support with KV-cache inputs across dense/sparse/gated variants, and introduces shared caching utilities for Triton kernel compilation/launch configuration to improve reuse across forward/decode paths.

Changes:

Add new Triton decode kernels + public *_with_kvcache_func APIs for dense/sparse/gated (base + varlen KV).
Introduce shared Triton caching utilities (compiled kernel cache, launch-config/grid caching) and apply them across kernels.
Add decode correctness tests (including preallocated output/LSE buffers) and update decode benchmarks to use KV-cache APIs.

Reviewed changes

Copilot reviewed 28 out of 28 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
tests/test_utils.py	Adds decode test helpers calling new KV-cache decode APIs and reference checks.
tests/test_dense_base_decode.py	New dense base decode correctness test (optionally preallocated buffers).
tests/test_dense_varlen_decode.py	New dense varlen decode correctness test (optionally preallocated buffers).
tests/test_sparse_base_decode.py	New sparse base decode correctness test (optionally preallocated buffers).
tests/test_sparse_varlen_decode.py	New sparse varlen decode correctness test (optionally preallocated buffers).
tests/test_gated_base_decode.py	New gated base decode correctness test (optionally preallocated buffers).
tests/test_gated_varlen_decode.py	New gated varlen decode correctness test (optionally preallocated buffers).
tests/benchmark_decode.py	Switch decode benchmarks to KV-cache decode APIs and decode-shaped inputs.
flash_sparse_attn/ops/triton/utils.py	Removes old `get_arch`; adds caching to `num_splits_heuristic` and adjusts heuristic.
flash_sparse_attn/ops/triton/cache_utils.py	New shared caching helpers (device arch/SMs, launch-config/grid caching, compiled kernel caching wrapper).
flash_sparse_attn/ops/triton/launch_template.py	Refactors launch-config selection to accept `(device, arch)` and caches results; adds decode + combine launch configs.
flash_sparse_attn/ops/triton/launch_grid.py	Adds cached grid factories; adds decode grid + decode-combine grid; adds forward-combine grid.
flash_sparse_attn/ops/triton/assert_inputs.py	Extends validation for decode + optional outputs; refactors fwd/bwd validation to accept `(device, arch)` and supports `seqused_*`.
flash_sparse_attn/ops/triton/interface.py	Exposes new public KV-cache decode APIs for dense/sparse/gated (base + varlen).
flash_sparse_attn/ops/triton/flash_dense_fwd.py	Adds caching wrappers and forward-combine usage; refactors to use `(device, arch)`; adds `is_split_kv` parameter.
flash_sparse_attn/ops/triton/flash_dense_dec.py	New dense decode kernels (base + varlen KV) and decode entry points.
flash_sparse_attn/ops/triton/flash_dense_bwd.py	Adds caching wrappers; refactors to use `(device, arch)`; cache-modifier tweaks.
flash_sparse_attn/ops/triton/flash_sparse_fwd.py	Adds caching wrappers; refactors to use `(device, arch)`; cache-modifier tweaks.
flash_sparse_attn/ops/triton/flash_sparse_dec.py	New sparse decode kernels (base + varlen KV) and decode entry points.
flash_sparse_attn/ops/triton/flash_sparse_bwd.py	Adds caching wrappers; refactors to use `(device, arch)`; cache-modifier tweaks.
flash_sparse_attn/ops/triton/flash_gated_fwd.py	Adds caching wrappers; refactors to use `(device, arch)`; fixes launch-config selector for gated; cache-modifier tweaks.
flash_sparse_attn/ops/triton/flash_gated_dec.py	New gated decode kernels (base + varlen KV) and decode entry points.
flash_sparse_attn/ops/triton/flash_gated_bwd.py	Adds caching wrappers; refactors to use `(device, arch)`; fixes launch-config selector for gated; cache-modifier tweaks.
flash_sparse_attn/ops/triton/flash_dec_combine.py	Refactors decode-combine kernel shape/launch; adds caching wrappers and `(device, arch)` launch config; updates stride handling.
flash_sparse_attn/ops/triton/flash_fwd_combine.py	New forward split-KV combine kernel and launcher using cached launch/grid utilities.
flash_sparse_attn/ops/triton/flash_bwd_preprocess.py	Wraps preprocess Triton kernel with compiled-kernel caching.
flash_sparse_attn/ops/triton/flash_bwd_postprocess.py	Wraps postprocess Triton kernel with compiled-kernel caching.
flash_sparse_attn/init.py	Exports new top-level KV-cache decode APIs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+
+    :return num_sms: Number of streaming multiprocessors.
+    """
+    return torch.cuda.get_device_properties(device).multi_processor_count


+    is_split_kv: bool = False,
    pack_gqa: bool = False,
 ) -> Tuple[torch.Tensor, torch.Tensor, float]:
-    num_SMs = torch.cuda.get_device_properties(query.device).multi_processor_count
+    device = query.device
+    arch = cache_utils.get_device_arch(device)
+    num_SMs = cache_utils.get_device_num_sms(device)
    batch_size, seqlen_q, num_heads_q, head_dim = query.shape
    _, seqlen_k, num_heads_kv, _ = key.shape
-    is_split_kv = seqlen_q == 1 and seqlen_q != seqlen_k
    window_size_left, window_size_right = window_size


+    is_split_kv: bool = False,
    pack_gqa: bool = False,
+    seqused_q: Optional[torch.Tensor] = None,
+    seqused_k: Optional[torch.Tensor] = None,
 ) -> Tuple[torch.Tensor, torch.Tensor, float]:
-    num_SMs = torch.cuda.get_device_properties(query.device).multi_processor_count
+    device = query.device
+    arch = cache_utils.get_device_arch(device)
+    num_SMs = cache_utils.get_device_num_sms(device)
    total_seqlen_q, num_heads_q, head_dim = query.shape
    _, num_heads_kv, _ = key.shape
    batch_size = cu_seqlens_q.shape[0] - 1
    seqlen_q = max_seqlen_q
    seqlen_k = max_seqlen_k
-    is_split_kv = seqlen_q == 1 and seqlen_q != seqlen_k
    window_size_left, window_size_right = window_size


+    # inv_sum = tl.where((e_sum == 0.0) | (e_sum != e_sum), 0.0, 1.0 / e_sum)
+    inv_sum = 1.0 / e_sum


+    # inv_sum = tl.where((e_sum == 0.0) | (e_sum != e_sum), 0.0, 1.0 / e_sum)
+    inv_sum = 1.0 / e_sum


…hanced flexibility Co-authored-by: Copilot <[email protected]>

LoserCheems and others added 30 commits April 22, 2026 16:33

Refactor max_splits calculation in num_splits_heuristic for improved …

aa4ffcc

…clarity Co-authored-by: Copilot <[email protected]>

Add lru_cache decorator to num_splits_heuristic for improved performance

746a4a1

Co-authored-by: Copilot <[email protected]>

Implement caching utilities for Triton kernels and device architecture

4cfbe52

Co-authored-by: Copilot <[email protected]>

Refactor kernel function to use cache_utils for device properties ret…

a79613a

…rieval Co-authored-by: Copilot <[email protected]>

Refactor kernel functions to utilize cache_utils for device propertie…

4915fb2

…s retrieval Co-authored-by: Copilot <[email protected]>

Add cache_grid_factory to grid functions for improved caching

90076ef

Co-authored-by: Copilot <[email protected]>

Add caching to launch configuration functions for improved performance

4078eb1

Co-authored-by: Copilot <[email protected]>

Refactor kernel functions to utilize cache_utils for device propertie…

b8385ca

…s retrieval Co-authored-by: Copilot <[email protected]>

Fix grid handling in _CachedLauncher to ensure compatibility with Tri…

b0bf4ff

…ton's launcher expectations Co-authored-by: Copilot <[email protected]>

Fix argument handling in _CachedLauncher to match Triton's kernel dec…

24082f2

…laration order Co-authored-by: Copilot <[email protected]>

Refactor cache_utils to improve device architecture key handling and …

8845d96

…streamline caching mechanisms

Refactor _bwd_postprocess_kernel to utilize cache_utils for improved …

aa3beee

…kernel wrapping Co-authored-by: Copilot <[email protected]>

Refactor _bwd_preprocess_kernel to include cache_utils for enhanced k…

72f991b

…ernel wrapping Co-authored-by: Copilot <[email protected]>

Wrap _bwd_dense_base_kernel with cache_utils for enhanced kernel mana…

cc78eae

…gement Co-authored-by: Copilot <[email protected]>

Wrap _bwd_gated_base_kernel with cache_utils for improved kernel mana…

6ffd506

…gement Co-authored-by: Copilot <[email protected]>

Wrap _bwd_sparse_base_kernel with cache_utils for improved kernel man…

5ba422c

…agement Co-authored-by: Copilot <[email protected]>

Wrap _dec_combine_kernel with cache_utils for improved kernel management

fa5d2cc

Co-authored-by: Copilot <[email protected]>

Add assert_dec_inputs function for validating decode kernel inputs

c0b7fd7

Refactor cache management in cache_utils.py for improved kernel handling

493f976

Remove get_arch function for device architecture retrieval

ee64a08

Refactor grid functions to use cache_launch_grid for improved kernel …

e48bc5d

…management and add get_dec_grid function for decode kernel

Refactor launch configuration functions to use cache_utils for device…

f41666b

… architecture retrieval and add get_dec_dense_launch_config for decode dense kernel

Implement dense attention kernel and decoding functions in flash_dens…

807b6b5

…e_dec.py

Refactor _dec_combine_kernel and _flash_attn_dec_combine for improved…

5411a3e

… performance and clarity Co-authored-by: Copilot <[email protected]>

Refactor input assertions in assert_inputs.py for consistency and imp…

ce08d55

…roved device handling

Refactor _dec_combine_kernel and _flash_attn_dec_combine for improved…

a539e4b

… stride handling and output summation Co-authored-by: Copilot <[email protected]>

Refactor dense attention kernel in flash_dense_dec.py for improved pa…

ee62f36

…rameter handling and clarity Co-authored-by: Copilot <[email protected]>

Refactor launch configuration functions in launch_template.py for imp…

a103139

…roved parameter handling and clarity

Refactor dense attention kernel in flash_dense_dec.py to improve cach…

532af46

…e handling with cache modifiers

Refactor stride handling in _flash_attn_dec_combine for improved vari…

b20404e

…able length support Co-authored-by: Copilot <[email protected]>

LoserCheems and others added 25 commits April 26, 2026 00:30

Add device and architecture parameters to dense attention forward fun…

b9daeea

…ctions

Enhance backward kernel functions with cache modifiers and device arc…

d92a1b1

…hitecture parameters Co-authored-by: Copilot <[email protected]>

Add cache modifiers and device architecture parameters to sparse atte…

3af7f70

…ntion kernels Co-authored-by: Copilot <[email protected]>

Add cache modifiers to backward kernel functions and include device a…

048d820

…rchitecture parameters

Add cache modifiers and device architecture parameters to gated atten…

aca73f6

…tion kernels

Add cache modifiers to backward kernel functions and include device a…

c8b8057

…rchitecture parameters Co-authored-by: Copilot <[email protected]>

Add validation for optional output tensors in decode kernel

359b088

Co-authored-by: Copilot <[email protected]>

Add launch configuration functions for decode sparse and gated kernels

1f79cbb

Co-authored-by: Copilot <[email protected]>

Add optional output tensors to dense attention decode functions

5001745

Co-authored-by: Copilot <[email protected]>

Implement sparse attention decoding kernel and associated functions

af2211a

Add gated attention decoding kernel and associated functions

fc38701

Add KV cache support for dense and sparse attention decoding functions

53e54bb

Co-authored-by: Copilot <[email protected]>

Add missing KV cache functions to the module exports

03d1881

Co-authored-by: Copilot <[email protected]>

Update benchmark functions to use KV cache variants for dense, sparse…

4bd809e

…, and gated attention Co-authored-by: Copilot <[email protected]>

Add decoding functions with KV cache support for dense, sparse, and g…

629cfb5

…ated attention Co-authored-by: Copilot <[email protected]>

Add unit tests for dense, gated, and sparse decoding functions with o…

25d4cf7

…utput buffer support Co-authored-by: Copilot <[email protected]>

Refactor assert_dec_inputs to remove unused parameters and simplify d…

60c8282

…evice checks for cumulative sequence lengths and sequences Co-authored-by: Copilot <[email protected]>

Refactor dense attention decoding functions to remove unused paramete…

b2293d5

…rs and simplify tensor shapes Co-authored-by: Copilot <[email protected]>

Refactor sparse attention decoding functions to remove unused paramet…

1071fd7

…ers and simplify tensor shapes Co-authored-by: Copilot <[email protected]>

Refactor gated attention decoding functions to remove unused paramete…

4c3bf0e

…rs and simplify tensor shapes Co-authored-by: Copilot <[email protected]>

Refactor variable-length attention functions to remove unused paramet…

23fc127

…ers and update tensor shape descriptions Co-authored-by: Copilot <[email protected]>

Refactor run_decode_varlen_case to simplify tensor shapes and remove …

f7322de

…unused parameters Co-authored-by: Copilot <[email protected]>

Remove unused lens_q parameter from varlen decode test cases

27cfd5f

Co-authored-by: Copilot <[email protected]>

Refactor decode functions to simplify parameter passing and improve r…

a9fa28c

…eadability Co-authored-by: Copilot <[email protected]>

Refactor decode functions to use keyword arguments for clarity and co…

4d33742

…nsistency Co-authored-by: Copilot <[email protected]>

Copilot AI review requested due to automatic review settings April 26, 2026 03:47

Copilot started reviewing on behalf of LoserCheems April 26, 2026 03:47 View session

Copilot AI reviewed Apr 26, 2026

View reviewed changes

Add is_split_kv and pack_gqa parameters to attention functions for en…

4527a3c

…hanced flexibility Co-authored-by: Copilot <[email protected]>

LoserCheems merged commit 6b009de into main Apr 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE SUPPORT] Add Triton decode support with KV-cache APIs#271

[FEATURE SUPPORT] Add Triton decode support with KV-cache APIs#271
LoserCheems merged 59 commits intomainfrom
support-triton-decode

LoserCheems commented Apr 26, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		# inv_sum = tl.where((e_sum == 0.0) \| (e_sum != e_sum), 0.0, 1.0 / e_sum)
		inv_sum = 1.0 / e_sum

Conversation

LoserCheems commented Apr 26, 2026

Summary

Design

Changes

Implementation Notes

Tests

Docs

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants