Skip to content

Conversation

@alexzms
Copy link

@alexzms alexzms commented Nov 28, 2025

This PR updates the Video Sparse Attention ThunderKittens implementation to support scenarios where the Query and Key/Value sequence lengths differ (different number of blocks).

Kernel Logic Validity:
- Forward Pass: No internal kernel modifications were necessary. The existing Softmax implementation already utilizes K-masking, ensuring logical correctness even when KV lengths differ from Q.
- Backward Pass: No internal kernel modifications were necessary. While $dK$ computation depends on $Q$ and $dO$ across the sequence, padded regions in the upstream gradient are numerically zero. Therefore, the existing accumulation logic remains mathematically correct.

Interface Updates:
Updated the CUDA host code API to explicitly handle distinct q_seq_len and kv_seq_len arguments, replacing the previous single seq_len assumption.

Testing:
Added csrc/attn/tests/test_vsa_forward.py for forward-only verification.
Updated csrc/attn/tests/test_vsa.py to include backward tests.
Introduced qkdiff test cases to specifically verify correctness when q block number different to kv block number.
All tests pass. Compilation and installation steps remain consistent with the existing documentation.

Future TODO:
Make triton kernel also support this.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @alexzms, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the Video Sparse Attention (VSA) system by introducing support for variable Query and Key/Value sequence lengths. This change removes a previous limitation where Q and KV were assumed to have identical sequence lengths, thereby increasing the adaptability and utility of the VSA implementation for diverse model architectures and data patterns. The modifications primarily involve updating the API and testing infrastructure, as the underlying kernel logic was already robust enough to handle these variations.

Highlights

  • Variable Q/KV Sequence Length Support: The Video Sparse Attention (VSA) implementation now supports scenarios where Query (Q) and Key/Value (KV) sequence lengths differ, allowing for more flexible attention mechanisms.
  • Kernel Logic Integrity: No internal kernel modifications were required for either the forward or backward passes. The existing K-masking in Softmax and the handling of numerically zero padded regions in gradient computation ensure logical and mathematical correctness.
  • API Updates: The CUDA host code API has been updated to explicitly accept distinct q_seq_len and kv_seq_len arguments, replacing the previous assumption of a single seq_len.
  • Comprehensive Testing: New forward-only tests (csrc/attn/tests/test_vsa_forward.py) and updates to existing backward tests (csrc/attn/tests/test_vsa.py) have been added, including specific qkdiff test cases to verify correctness with differing Q and KV block numbers. All tests are reported to pass.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request successfully updates the Video Sparse Attention (VSA) implementation to support variable sequence lengths for Query and Key/Value tensors. The changes to the CUDA kernel, Python wrappers, and test suite are comprehensive and well-executed. I've added a new forward-only test and expanded existing tests to cover नेता Q/KV length scenarios. My review includes a few suggestions to enhance test robustness and improve code clarity through more consistent variable naming across the Python and CUDA layers. I also noted a misleading comment in a test file that should be corrected.



def block_sparse_kernel_test(Q, K, V, block_sparse_mask, variable_block_sizes, non_pad_index, dO):
def block_sparse_kernel_test(Q, K, V, block_sparse_mask, variable_block_sizes, q_non_pad_index, kv_non_pad_index, q_num_blocks, kv_num_blocks, dO):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For clarity, consider renaming variable_block_sizes to kv_variable_block_sizes in the function signature. This parameter is used for the K and V tensors, and the underlying CUDA kernel now expects kv_block_size. This change would make the Python test code more consistent with the C++/CUDA layer. The usage on line 45 should be updated accordingly.

Suggested change
def block_sparse_kernel_test(Q, K, V, block_sparse_mask, variable_block_sizes, q_non_pad_index, kv_non_pad_index, q_num_blocks, kv_num_blocks, dO):
def block_sparse_kernel_test(Q, K, V, block_sparse_mask, kv_variable_block_sizes, q_non_pad_index, kv_non_pad_index, q_num_blocks, kv_num_blocks, dO):

abs_diff = torch.abs(diff)
results[name]['sum_diff'] += torch.sum(abs_diff).item()
results[name]['sum_abs'] += torch.sum(torch.abs(pt)).item()
rel_max_diff = torch.max(abs_diff) / torch.mean(torch.abs(pt))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The calculation of rel_max_diff could lead to a division by zero if torch.mean(torch.abs(pt)) is zero. This could cause NaN or inf values, making the test results unreliable. It's safer to add a small epsilon to the denominator to prevent this, similar to what's done in test_vsa_forward.py.

Suggested change
rel_max_diff = torch.max(abs_diff) / torch.mean(torch.abs(pt))
rel_max_diff = torch.max(abs_diff) / (torch.mean(torch.abs(pt)) + 1e-6)

K: torch.Tensor,
V: torch.Tensor,
block_sparse_mask: torch.Tensor,
variable_block_sizes: torch.Tensor,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For clarity and consistency with the underlying CUDA/Triton kernels, consider renaming variable_block_sizes to kv_variable_block_sizes. This parameter specifically refers to the block sizes of the Key and Value tensors. The usage on line 71 should be updated accordingly.

Suggested change
variable_block_sizes: torch.Tensor,
kv_variable_block_sizes: torch.Tensor,

Comment on lines +156 to +158
- The Triton backend supports different Q/KV logical lengths via padding.
- The SM90 (H100) CUDA backend currently assumes the same number of blocks
for Q and KV, so we skip this test there.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This comment appears to be incorrect. The PR description states that the Triton kernel does not yet support variable Q/KV lengths, while this PR adds support for it in the SM90 (H100) CUDA backend. The comment states the opposite. Please correct this to avoid confusion.

@alexzms alexzms changed the title [Feature] Support for Variable Q/KV Sequence Lengths in VSA [Feature] Support for Variable Q/KV Sequence Lengths in VSA ThunderKittens kernel Nov 28, 2025
@SolitaryThinker SolitaryThinker added the go Trigger Buildkite CI label Dec 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go Trigger Buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants