- 
          
 - 
                Notifications
    
You must be signed in to change notification settings  - Fork 11k
 
[BUG] Fix hybrid kvcache kernel page size issue #27547
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Vadim Gimpelson <[email protected]>
Signed-off-by: Vadim Gimpelson <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request effectively resolves a critical bug in the hybrid KV cache for the FlashInfer backend, where an incorrect block size was being passed to the attention kernel, leading to significant performance degradation. The fix correctly determines the kernel_block_size and uses it for the FlashInfer page_size, which is the right approach. The accompanying refactoring, which moves the logic for finding compatible block sizes into AttentionSpec, is a good design improvement that centralizes the logic and removes code duplication from GPUModelRunner. The changes are well-executed and appear robust.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
 - Mark a draft as ready
 - Comment "@codex review".
 
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Signed-off-by: Vadim Gimpelson <[email protected]>
Signed-off-by: Vadim Gimpelson <[email protected]>
Signed-off-by: Vadim Gimpelson <[email protected]>
| 
           Because the root came from #24486  | 
    
Signed-off-by: Vadim Gimpelson <[email protected]>
| 
           FYI for the discussion related to this bug #26936  | 
    
| 
           @pavanimajety @heheda12345  | 
    
| 
           I have solved this is a slightly different way in this PR: #27753 I opted to change the way we intialize things so we can pass the kernel block sizes to the constructor of the attention metadata builders.  | 
    
| # When using hybrid blocks(self.kv_cache_spec.block_size != kernel_block_size), | ||
| # the KV cache is allocated with kernel_block_size, so we must use that | ||
| # for page_size when calling FlashInfer. | ||
| kernel_block_size = self.kv_cache_spec.find_compatible_kernel_block_sizes( | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you refactor _select_common_block_size a bit so that you can call something like GPUModelRunner._select_common_block_size? I don't think it's a good idea to move to self.kv_cache_spec.find_compatible_kernel_block_sizes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid misunderstand I'd like to double check do you mean find_compatible_kernel_block_sizes or _select_common_block_size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depends on you. Just call GPUModelRunner.xxxx() here and minimize the changes to gpu model runner
| 
           This pull request has merge conflicts that must be resolved before it can be  | 
    
| 
           I'm not going to play this game  | 
    
Purpose
The following
Produced
The issue was with intermix of
kv_manager_block_sizeandkernel_block_sizewe passedkv_manager_block_sizeinstead ofkernel_block_sizethat cause garbage loads in attn kernel.Test Result
After fix