-
Notifications
You must be signed in to change notification settings - Fork 604
[Performance] Improve the inference performance of Eagle3. #4441
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
vLLM version: v0.11.0 vLLM main: vllm-project/vllm Signed-off-by: liumail202512 <[email protected]>
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors _prepare_inputs in EagleProposer to use PyTorch operations instead of NumPy, which should improve performance by avoiding CPU-NPU data transfers. The changes look good overall, but I've found a critical issue with a hardcoded block size and a type correctness issue that could lead to bugs. Please see my comments for details and suggestions.
| BLOCK_SIZE = 1024 | ||
| self._prepare_eagle_input_sequential( | ||
| token_indices, | ||
| cu_target_query_lens, | ||
| cu_num_tokens, | ||
| block_size=BLOCK_SIZE, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The BLOCK_SIZE is hardcoded to 1024. The _prepare_eagle_input_sequential method uses this block_size to create an offsets tensor, assuming that the number of tokens per request will not exceed this value. However, the number of tokens per request can be up to max_num_batched_tokens (default 2560), which is larger than 1024. If a request has more than 1024 tokens, this will lead to incorrect indexing and corrupt output. This is a critical bug. To fix this, block_size should be determined dynamically based on the maximum number of tokens per request in the current batch.
| BLOCK_SIZE = 1024 | |
| self._prepare_eagle_input_sequential( | |
| token_indices, | |
| cu_target_query_lens, | |
| cu_num_tokens, | |
| block_size=BLOCK_SIZE, | |
| ) | |
| if num_tokens > 0: | |
| block_size = int(torch.max(num_tokens_per_req).item()) | |
| self._prepare_eagle_input_sequential( | |
| token_indices, | |
| cu_target_query_lens, | |
| cu_num_tokens, | |
| block_size=block_size, | |
| ) |
| ) | ||
| cu_num_tokens, token_indices =\ | ||
| self._prepare_inputs(eagle_attn_metadata, num_rejected_tokens) | ||
| num_tokens = num_scheduled_tokens - sum(num_rejected_tokens) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The result of sum(num_rejected_tokens) is a 0-dimensional tensor. When subtracted from num_scheduled_tokens (an int), the result num_tokens is also a 0-dimensional tensor. However, the _prepare_inputs function is type-hinted to accept an int for num_tokens. This type mismatch could lead to unexpected behavior. Please convert the tensor to a Python integer using .item() for type correctness and clarity.
| num_tokens = num_scheduled_tokens - sum(num_rejected_tokens) | |
| num_tokens = num_scheduled_tokens - torch.sum(num_rejected_tokens).item() |
vLLM version: v0.11.0
vLLM main: vllm-project/vllm
What this PR does / why we need it?
Does this PR introduce any user-facing change?
How was this patch tested?