[Performance] Improve the inference performance of Eagle3. #4441

liumain1122 · 2025-11-25T16:25:55Z

vLLM version: v0.11.0
vLLM main: vllm-project/vllm

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

vLLM version: v0.11.0
vLLM main: vllm-project/vllm@2918c1b

vLLM version: v0.11.0 vLLM main: vllm-project/vllm Signed-off-by: liumail202512 <[email protected]>

github-actions · 2025-11-25T16:26:03Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

This pull request refactors _prepare_inputs in EagleProposer to use PyTorch operations instead of NumPy, which should improve performance by avoiding CPU-NPU data transfers. The changes look good overall, but I've found a critical issue with a hardcoded block size and a type correctness issue that could lead to bugs. Please see my comments for details and suggestions.

gemini-code-assist · 2025-11-25T16:29:00Z

vllm_ascend/spec_decode/eagle_proposer.py

+        BLOCK_SIZE = 1024
+        self._prepare_eagle_input_sequential(
+            token_indices,
+            cu_target_query_lens,
+            cu_num_tokens,
+            block_size=BLOCK_SIZE,
+        )


The BLOCK_SIZE is hardcoded to 1024. The _prepare_eagle_input_sequential method uses this block_size to create an offsets tensor, assuming that the number of tokens per request will not exceed this value. However, the number of tokens per request can be up to max_num_batched_tokens (default 2560), which is larger than 1024. If a request has more than 1024 tokens, this will lead to incorrect indexing and corrupt output. This is a critical bug. To fix this, block_size should be determined dynamically based on the maximum number of tokens per request in the current batch.

Suggested change

BLOCK_SIZE = 1024

self._prepare_eagle_input_sequential(

token_indices,

cu_target_query_lens,

cu_num_tokens,

block_size=BLOCK_SIZE,

)

if num_tokens > 0:

block_size = int(torch.max(num_tokens_per_req).item())

self._prepare_eagle_input_sequential(

token_indices,

cu_target_query_lens,

cu_num_tokens,

block_size=block_size,

)

gemini-code-assist · 2025-11-25T16:29:01Z

vllm_ascend/spec_decode/eagle_proposer.py

            )
-            cu_num_tokens, token_indices =\
-                    self._prepare_inputs(eagle_attn_metadata, num_rejected_tokens)
+            num_tokens = num_scheduled_tokens - sum(num_rejected_tokens)


The result of sum(num_rejected_tokens) is a 0-dimensional tensor. When subtracted from num_scheduled_tokens (an int), the result num_tokens is also a 0-dimensional tensor. However, the _prepare_inputs function is type-hinted to accept an int for num_tokens. This type mismatch could lead to unexpected behavior. Please convert the tensor to a Python integer using .item() for type correctness and clarity.

Suggested change

num_tokens = num_scheduled_tokens - sum(num_rejected_tokens)

num_tokens = num_scheduled_tokens - torch.sum(num_rejected_tokens).item()

[Performance] Improve the inference performance of Eagle3.

167f138

vLLM version: v0.11.0 vLLM main: vllm-project/vllm Signed-off-by: liumail202512 <[email protected]>

gemini-code-assist bot reviewed Nov 25, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Performance] Improve the inference performance of Eagle3. #4441

[Performance] Improve the inference performance of Eagle3. #4441

Uh oh!

liumain1122 commented Nov 25, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Nov 25, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Nov 25, 2025

Uh oh!

gemini-code-assist bot Nov 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	num_tokens = num_scheduled_tokens - sum(num_rejected_tokens)
	num_tokens = num_scheduled_tokens - torch.sum(num_rejected_tokens).item()

[Performance] Improve the inference performance of Eagle3. #4441

Are you sure you want to change the base?

[Performance] Improve the inference performance of Eagle3. #4441

Uh oh!

Conversation

liumain1122 commented Nov 25, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions bot commented Nov 25, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

liumain1122 commented Nov 25, 2025 •

edited by github-actions bot

Loading