-
Notifications
You must be signed in to change notification settings - Fork 604
Async Scheduling X Spec Decoding Compatibility #4464
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
… GC optimization Signed-off-by: jesse <[email protected]>
Signed-off-by: jesse <[email protected]>
Signed-off-by: jesse <[email protected]>
…kens for GC optimization" This reverts commit f1cadf1. Signed-off-by: jesse <[email protected]>
Signed-off-by: jesse <[email protected]>
Signed-off-by: jesse <[email protected]>
Signed-off-by: jesse <[email protected]>
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for speculative decoding with asynchronous scheduling on Ascend NPUs. The changes are primarily within vllm_ascend/worker/model_runner_v1.py, modifying the model execution and token sampling logic to handle the new asynchronous speculative decoding path. While the overall approach seems reasonable, my review has identified several critical issues that need to be addressed. These include a logic error that nullifies the speculative decoding output processing, duplicated code blocks, and incorrect usage of CUDA-specific stream APIs instead of the appropriate NPU APIs for the Ascend platform. There is also a syntax error that appears to be a merge artifact. These issues will prevent the code from functioning correctly and must be fixed.
Signed-off-by: jesse <[email protected]>
Signed-off-by: jesse <[email protected]>
Signed-off-by: jesse <[email protected]>
Signed-off-by: jesse <[email protected]>
What this PR does / why we need it?
this PR is based on PR vllm-project/vllm#24799 ,and aims to support spec decode with async_scheduling.
and fix crash in Eagle Speculative Decoding models when the input length exceed the drafter model length but not the verifier's based on pr vllm-project/vllm#24662
Does this PR introduce any user-facing change?
How was this patch tested?