[BUG] Fix hybrid kvcache kernel page size issue #27547

vadiklyutiy · 2025-10-27T03:33:18Z

Purpose

The following

VLLM_USE_TRTLLM_ATTENTION=0 vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct -tp 4 --enable-expert-parallel --no-enable-prefix-caching --async-scheduling

lm_eval --model local-completions --model_args model=Qwen/Qwen3-Next-80B-A3B-Instruct,base_url=http://localhost:8000/v1/completions -t gsm8k --num_fewshot 5 --batch_size 250

Produced

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.0197|±  |0.0038|
|     |       |strict-match    |     5|exact_match|↑  |0.0000|±  |0.0000|

The issue was with intermix of kv_manager_block_size and kernel_block_size we passed kv_manager_block_size instead of kernel_block_size that cause garbage loads in attn kernel.

Test Result

After fix

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8514|±  |0.0098|
|     |       |strict-match    |     5|exact_match|↑  |0.8089|±  |0.0108|

Signed-off-by: Vadim Gimpelson <[email protected]>

gemini-code-assist

Code Review

This pull request effectively resolves a critical bug in the hybrid KV cache for the FlashInfer backend, where an incorrect block size was being passed to the attention kernel, leading to significant performance degradation. The fix correctly determines the kernel_block_size and uses it for the FlashInfer page_size, which is the right approach. The accompanying refactoring, which moves the logic for finding compatible block sizes into AttentionSpec, is a good design improvement that centralizes the logic and removes code duplication from GPUModelRunner. The changes are well-executed and appear robust.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

vllm/v1/attention/backends/flashinfer.py

Signed-off-by: Vadim Gimpelson <[email protected]>

vadiklyutiy · 2025-10-27T19:44:50Z

Because the root came from #24486
CC @zhiyuan1i @heheda12345

vllm/v1/attention/backends/flashinfer.py

Signed-off-by: Vadim Gimpelson <[email protected]>

heheda12345 · 2025-10-28T01:40:33Z

FYI for the discussion related to this bug #26936

vadiklyutiy · 2025-10-28T03:31:17Z

FYI for the discussion related to this bug #26936

Here at least 2 bugs :)
Seems there are a lot of different bugs discussed in #26936. But anyway some of them should be fixed with this PR

vadiklyutiy · 2025-10-28T23:09:18Z

@pavanimajety @heheda12345
Could you please take a look at this PR?

tdoublep · 2025-10-29T18:43:24Z

I have solved this is a slightly different way in this PR: #27753

I opted to change the way we intialize things so we can pass the kernel block sizes to the constructor of the attention metadata builders.

heheda12345 · 2025-10-31T00:36:17Z

vllm/v1/attention/backends/flashinfer.py

+        # When using hybrid blocks(self.kv_cache_spec.block_size != kernel_block_size),
+        # the KV cache is allocated with kernel_block_size, so we must use that
+        # for page_size when calling FlashInfer.
+        kernel_block_size = self.kv_cache_spec.find_compatible_kernel_block_sizes(


can you refactor _select_common_block_size a bit so that you can call something like GPUModelRunner._select_common_block_size? I don't think it's a good idea to move to self.kv_cache_spec.find_compatible_kernel_block_sizes

To avoid misunderstand I'd like to double check do you mean find_compatible_kernel_block_sizes or _select_common_block_size?

Depends on you. Just call GPUModelRunner.xxxx() here and minimize the changes to gpu model runner

mergify · 2025-10-31T22:51:01Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @vadiklyutiy.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

vadiklyutiy · 2025-11-01T01:03:36Z

I'm not going to play this game

vadiklyutiy added 2 commits October 27, 2025 06:31

fix kv hybrid page size bug

9462b65

Signed-off-by: Vadim Gimpelson <[email protected]>

fix

9c2db79

Signed-off-by: Vadim Gimpelson <[email protected]>

vadiklyutiy requested review from heheda12345, mgoin and pavanimajety as code owners October 27, 2025 03:33

mergify bot added the v1 label Oct 27, 2025

gemini-code-assist bot reviewed Oct 27, 2025

View reviewed changes

chatgpt-codex-connector bot reviewed Oct 27, 2025

View reviewed changes

vllm/v1/attention/backends/flashinfer.py Outdated Show resolved Hide resolved

vadiklyutiy added 3 commits October 27, 2025 07:55

fix pre-commit

4b4aaa8

Signed-off-by: Vadim Gimpelson <[email protected]>

fix pre-commit

b2d00a1

Signed-off-by: Vadim Gimpelson <[email protected]>

fix

955f38d

Signed-off-by: Vadim Gimpelson <[email protected]>

pavanimajety reviewed Oct 27, 2025

View reviewed changes

vllm/v1/attention/backends/flashinfer.py Outdated Show resolved Hide resolved

pavanimajety reviewed Oct 27, 2025

View reviewed changes

vllm/v1/attention/backends/flashinfer.py Outdated Show resolved Hide resolved

pavanimajety added the bug Something isn't working label Oct 27, 2025

fix code review comments

e7c39ce

Signed-off-by: Vadim Gimpelson <[email protected]>

vadiklyutiy mentioned this pull request Oct 28, 2025

[Bug]: Hybrid Attention models broken after switching to flashinfer 0.4 (tested on Granite 4.0 H, Qwen3-Next, Jamba-3B, Nemotron-H-8b) #26936

Open

1 task

vadiklyutiy requested a review from pavanimajety October 30, 2025 13:48

vadiklyutiy mentioned this pull request Oct 30, 2025

[Bugfix] Flashinfer block size for hybrid ssm models #27843

Draft

5 tasks

heheda12345 reviewed Oct 31, 2025

View reviewed changes

mergify bot added the needs-rebase label Oct 31, 2025

vadiklyutiy closed this Nov 1, 2025

Uh oh!

[BUG] Fix hybrid kvcache kernel page size issue #27547

[BUG] Fix hybrid kvcache kernel page size issue #27547

Uh oh!

Conversation

vadiklyutiy commented Oct 27, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

vadiklyutiy commented Oct 27, 2025

Uh oh!

Uh oh!

Uh oh!

heheda12345 commented Oct 28, 2025

Uh oh!

vadiklyutiy commented Oct 28, 2025

Uh oh!

vadiklyutiy commented Oct 28, 2025

Uh oh!

tdoublep commented Oct 29, 2025

Uh oh!

heheda12345 Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

vadiklyutiy Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

heheda12345 Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Oct 31, 2025

Uh oh!

vadiklyutiy commented Nov 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

vadiklyutiy commented Oct 27, 2025 •

edited by github-actions bot

Loading