Skip to content

Conversation

@vadiklyutiy
Copy link
Contributor

@vadiklyutiy vadiklyutiy commented Oct 27, 2025

Purpose

The following

VLLM_USE_TRTLLM_ATTENTION=0 vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct -tp 4 --enable-expert-parallel --no-enable-prefix-caching --async-scheduling
lm_eval --model local-completions --model_args model=Qwen/Qwen3-Next-80B-A3B-Instruct,base_url=http://localhost:8000/v1/completions -t gsm8k --num_fewshot 5 --batch_size 250

Produced

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.0197|±  |0.0038|
|     |       |strict-match    |     5|exact_match|↑  |0.0000|±  |0.0000|

The issue was with intermix of kv_manager_block_size and kernel_block_size we passed kv_manager_block_size instead of kernel_block_size that cause garbage loads in attn kernel.

Test Result

After fix

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8514|±  |0.0098|
|     |       |strict-match    |     5|exact_match|↑  |0.8089|±  |0.0108|

Signed-off-by: Vadim Gimpelson <[email protected]>
Signed-off-by: Vadim Gimpelson <[email protected]>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively resolves a critical bug in the hybrid KV cache for the FlashInfer backend, where an incorrect block size was being passed to the attention kernel, leading to significant performance degradation. The fix correctly determines the kernel_block_size and uses it for the FlashInfer page_size, which is the right approach. The accompanying refactoring, which moves the logic for finding compatible block sizes into AttentionSpec, is a good design improvement that centralizes the logic and removes code duplication from GPUModelRunner. The changes are well-executed and appear robust.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Signed-off-by: Vadim Gimpelson <[email protected]>
Signed-off-by: Vadim Gimpelson <[email protected]>
Signed-off-by: Vadim Gimpelson <[email protected]>
@vadiklyutiy
Copy link
Contributor Author

Because the root came from #24486
CC @zhiyuan1i @heheda12345

@pavanimajety pavanimajety added the bug Something isn't working label Oct 27, 2025
Signed-off-by: Vadim Gimpelson <[email protected]>
@heheda12345
Copy link
Collaborator

FYI for the discussion related to this bug #26936

@vadiklyutiy
Copy link
Contributor Author

FYI for the discussion related to this bug #26936

Here at least 2 bugs :)
Seems there are a lot of different bugs discussed in #26936. But anyway some of them should be fixed with this PR

@vadiklyutiy
Copy link
Contributor Author

@pavanimajety @heheda12345
Could you please take a look at this PR?

@tdoublep
Copy link
Member

I have solved this is a slightly different way in this PR: #27753

I opted to change the way we intialize things so we can pass the kernel block sizes to the constructor of the attention metadata builders.

# When using hybrid blocks(self.kv_cache_spec.block_size != kernel_block_size),
# the KV cache is allocated with kernel_block_size, so we must use that
# for page_size when calling FlashInfer.
kernel_block_size = self.kv_cache_spec.find_compatible_kernel_block_sizes(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you refactor _select_common_block_size a bit so that you can call something like GPUModelRunner._select_common_block_size? I don't think it's a good idea to move to self.kv_cache_spec.find_compatible_kernel_block_sizes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid misunderstand I'd like to double check do you mean find_compatible_kernel_block_sizes or _select_common_block_size?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depends on you. Just call GPUModelRunner.xxxx() here and minimize the changes to gpu model runner

@mergify
Copy link

mergify bot commented Oct 31, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @vadiklyutiy.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Oct 31, 2025
@vadiklyutiy
Copy link
Contributor Author

I'm not going to play this game

@vadiklyutiy vadiklyutiy closed this Nov 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working needs-rebase v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants