[Feature] Support for Variable Q/KV Sequence Lengths in VSA ThunderKittens kernel #911

alexzms · 2025-11-28T01:21:14Z

This PR updates the Video Sparse Attention ThunderKittens implementation to support scenarios where the Query and Key/Value sequence lengths differ (different number of blocks).

Kernel Logic Validity:
- Forward Pass: No internal kernel modifications were necessary. The existing Softmax implementation already utilizes K-masking, ensuring logical correctness even when KV lengths differ from Q.
- Backward Pass: No internal kernel modifications were necessary. While $dK$ computation depends on $Q$ and $dO$ across the sequence, padded regions in the upstream gradient are numerically zero. Therefore, the existing accumulation logic remains mathematically correct.

Interface Updates:
Updated the CUDA host code API to explicitly handle distinct q_seq_len and kv_seq_len arguments, replacing the previous single seq_len assumption.

Testing:
Added csrc/attn/tests/test_vsa_forward.py for forward-only verification.
Updated csrc/attn/tests/test_vsa.py to include backward tests.
Introduced qkdiff test cases to specifically verify correctness when q block number different to kv block number.
All tests pass. Compilation and installation steps remain consistent with the existing documentation.

Future TODO:
Make triton kernel also support this.

…d to be changed. The plan is just change the c++ api

gemini-code-assist · 2025-11-28T01:21:30Z

Summary of Changes

Hello @alexzms, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the Video Sparse Attention (VSA) system by introducing support for variable Query and Key/Value sequence lengths. This change removes a previous limitation where Q and KV were assumed to have identical sequence lengths, thereby increasing the adaptability and utility of the VSA implementation for diverse model architectures and data patterns. The modifications primarily involve updating the API and testing infrastructure, as the underlying kernel logic was already robust enough to handle these variations.

Highlights

Variable Q/KV Sequence Length Support: The Video Sparse Attention (VSA) implementation now supports scenarios where Query (Q) and Key/Value (KV) sequence lengths differ, allowing for more flexible attention mechanisms.
Kernel Logic Integrity: No internal kernel modifications were required for either the forward or backward passes. The existing K-masking in Softmax and the handling of numerically zero padded regions in gradient computation ensure logical and mathematical correctness.
API Updates: The CUDA host code API has been updated to explicitly accept distinct q_seq_len and kv_seq_len arguments, replacing the previous assumption of a single seq_len.
Comprehensive Testing: New forward-only tests (csrc/attn/tests/test_vsa_forward.py) and updates to existing backward tests (csrc/attn/tests/test_vsa.py) have been added, including specific qkdiff test cases to verify correctness with differing Q and KV block numbers. All tests are reported to pass.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request successfully updates the Video Sparse Attention (VSA) implementation to support variable sequence lengths for Query and Key/Value tensors. The changes to the CUDA kernel, Python wrappers, and test suite are comprehensive and well-executed. I've added a new forward-only test and expanded existing tests to cover नेता Q/KV length scenarios. My review includes a few suggestions to enhance test robustness and improve code clarity through more consistent variable naming across the Python and CUDA layers. I also noted a misleading comment in a test file that should be corrected.

gemini-code-assist · 2025-11-28T01:23:29Z

csrc/attn/tests/test_vsa.py



-def block_sparse_kernel_test(Q, K, V, block_sparse_mask, variable_block_sizes, non_pad_index, dO):
+def block_sparse_kernel_test(Q, K, V, block_sparse_mask, variable_block_sizes, q_non_pad_index, kv_non_pad_index, q_num_blocks, kv_num_blocks, dO):


For clarity, consider renaming variable_block_sizes to kv_variable_block_sizes in the function signature. This parameter is used for the K and V tensors, and the underlying CUDA kernel now expects kv_block_size. This change would make the Python test code more consistent with the C++/CUDA layer. The usage on line 45 should be updated accordingly.

Suggested change

def block_sparse_kernel_test(Q, K, V, block_sparse_mask, variable_block_sizes, q_non_pad_index, kv_non_pad_index, q_num_blocks, kv_num_blocks, dO):

def block_sparse_kernel_test(Q, K, V, block_sparse_mask, kv_variable_block_sizes, q_non_pad_index, kv_non_pad_index, q_num_blocks, kv_num_blocks, dO):

gemini-code-assist · 2025-11-28T01:23:29Z

csrc/attn/tests/test_vsa.py

+                abs_diff = torch.abs(diff)
+                results[name]['sum_diff'] += torch.sum(abs_diff).item()
+                results[name]['sum_abs'] += torch.sum(torch.abs(pt)).item()
+                rel_max_diff = torch.max(abs_diff) / torch.mean(torch.abs(pt))


The calculation of rel_max_diff could lead to a division by zero if torch.mean(torch.abs(pt)) is zero. This could cause NaN or inf values, making the test results unreliable. It's safer to add a small epsilon to the denominator to prevent this, similar to what's done in test_vsa_forward.py.

Suggested change

rel_max_diff = torch.max(abs_diff) / torch.mean(torch.abs(pt))

rel_max_diff = torch.max(abs_diff) / (torch.mean(torch.abs(pt)) + 1e-6)

gemini-code-assist · 2025-11-28T01:23:30Z

csrc/attn/tests/test_vsa_forward.py

+    K: torch.Tensor,
+    V: torch.Tensor,
+    block_sparse_mask: torch.Tensor,
+    variable_block_sizes: torch.Tensor,


For clarity and consistency with the underlying CUDA/Triton kernels, consider renaming variable_block_sizes to kv_variable_block_sizes. This parameter specifically refers to the block sizes of the Key and Value tensors. The usage on line 71 should be updated accordingly.

Suggested change

variable_block_sizes: torch.Tensor,

kv_variable_block_sizes: torch.Tensor,

gemini-code-assist · 2025-11-28T01:23:30Z

csrc/attn/tests/test_vsa_forward.py

+      - The Triton backend supports different Q/KV logical lengths via padding.
+      - The SM90 (H100) CUDA backend currently assumes the same number of blocks
+        for Q and KV, so we skip this test there.


This comment appears to be incorrect. The PR description states that the Triton kernel does not yet support variable Q/KV lengths, while this PR adds support for it in the SM90 (H100) CUDA backend. The comment states the opposite. Please correct this to avoid confusion.

alexzms added 13 commits November 24, 2025 21:49

framework for testing, backup original vsa tk kernel

0ff19e5

clangd setup

12e0619

only testing for forward

e8d90a7

minor change

1702d1e

fwd qkdiff finished and tested

22ae85c

vsa wrapper support different block size

c3815ef

rollback. deeply thinked, and I think kernel and wrapper does not nee…

93527be

…d to be changed. The plan is just change the c++ api

qklen diff bwd finished, tested

5ec8d2a

Merge branch 'hao-ai-lab:main' into qkdiff_vsa

ba7738e

remove dev stuff

b32e16b

remove gitignore

08325fd

remove original backup

ab148cb

remove md

ce8702d

gemini-code-assist bot reviewed Nov 28, 2025

View reviewed changes

alexzms changed the title ~~[Feature] Support for Variable Q/KV Sequence Lengths in VSA~~ [Feature] Support for Variable Q/KV Sequence Lengths in VSA ThunderKittens kernel Nov 28, 2025

SolitaryThinker added the go Trigger Buildkite CI label Dec 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] Support for Variable Q/KV Sequence Lengths in VSA ThunderKittens kernel #911

[Feature] Support for Variable Q/KV Sequence Lengths in VSA ThunderKittens kernel #911

Uh oh!

alexzms commented Nov 28, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Nov 28, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Nov 28, 2025

Uh oh!

gemini-code-assist bot Nov 28, 2025

Uh oh!

gemini-code-assist bot Nov 28, 2025

Uh oh!

gemini-code-assist bot Nov 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants



		def block_sparse_kernel_test(Q, K, V, block_sparse_mask, variable_block_sizes, non_pad_index, dO):
		def block_sparse_kernel_test(Q, K, V, block_sparse_mask, variable_block_sizes, q_non_pad_index, kv_non_pad_index, q_num_blocks, kv_num_blocks, dO):

	rel_max_diff = torch.max(abs_diff) / torch.mean(torch.abs(pt))
	rel_max_diff = torch.max(abs_diff) / (torch.mean(torch.abs(pt)) + 1e-6)

	variable_block_sizes: torch.Tensor,
	kv_variable_block_sizes: torch.Tensor,

[Feature] Support for Variable Q/KV Sequence Lengths in VSA ThunderKittens kernel #911

Are you sure you want to change the base?

[Feature] Support for Variable Q/KV Sequence Lengths in VSA ThunderKittens kernel #911

Uh oh!

Conversation

alexzms commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Nov 28, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

alexzms commented Nov 28, 2025 •

edited

Loading