Skip to content

Conversation

@liumain1122
Copy link

@liumain1122 liumain1122 commented Nov 25, 2025

vLLM version: v0.11.0
vLLM main: vllm-project/vllm

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

vLLM version: v0.11.0
vLLM main: vllm-project/vllm
Signed-off-by: liumail202512 <[email protected]>
@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors _prepare_inputs in EagleProposer to use PyTorch operations instead of NumPy, which should improve performance by avoiding CPU-NPU data transfers. The changes look good overall, but I've found a critical issue with a hardcoded block size and a type correctness issue that could lead to bugs. Please see my comments for details and suggestions.

Comment on lines +620 to +626
BLOCK_SIZE = 1024
self._prepare_eagle_input_sequential(
token_indices,
cu_target_query_lens,
cu_num_tokens,
block_size=BLOCK_SIZE,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The BLOCK_SIZE is hardcoded to 1024. The _prepare_eagle_input_sequential method uses this block_size to create an offsets tensor, assuming that the number of tokens per request will not exceed this value. However, the number of tokens per request can be up to max_num_batched_tokens (default 2560), which is larger than 1024. If a request has more than 1024 tokens, this will lead to incorrect indexing and corrupt output. This is a critical bug. To fix this, block_size should be determined dynamically based on the maximum number of tokens per request in the current batch.

Suggested change
BLOCK_SIZE = 1024
self._prepare_eagle_input_sequential(
token_indices,
cu_target_query_lens,
cu_num_tokens,
block_size=BLOCK_SIZE,
)
if num_tokens > 0:
block_size = int(torch.max(num_tokens_per_req).item())
self._prepare_eagle_input_sequential(
token_indices,
cu_target_query_lens,
cu_num_tokens,
block_size=block_size,
)

)
cu_num_tokens, token_indices =\
self._prepare_inputs(eagle_attn_metadata, num_rejected_tokens)
num_tokens = num_scheduled_tokens - sum(num_rejected_tokens)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The result of sum(num_rejected_tokens) is a 0-dimensional tensor. When subtracted from num_scheduled_tokens (an int), the result num_tokens is also a 0-dimensional tensor. However, the _prepare_inputs function is type-hinted to accept an int for num_tokens. This type mismatch could lead to unexpected behavior. Please convert the tensor to a Python integer using .item() for type correctness and clarity.

Suggested change
num_tokens = num_scheduled_tokens - sum(num_rejected_tokens)
num_tokens = num_scheduled_tokens - torch.sum(num_rejected_tokens).item()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant