Add MagicMTP(block verify) and Triton optimization #4443

chenaoxuan · 2025-11-26T00:37:31Z

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

vLLM version: v0.11.2
vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

github-actions · 2025-11-26T00:37:42Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

This pull request introduces Triton-based optimizations for rejection sampling and adds a "block verify" method, also known as MagicMTP. The changes include new Triton kernels for both greedy and random sampling, along with PyTorch fallback implementations. My review has identified a critical issue where a non-existent Triton function is called, which will prevent the code from running. Additionally, I've found a performance inefficiency in one of the new kernels and a use of a bare except clause, which is considered poor practice. I have provided suggestions to address these points.

gemini-code-assist · 2025-11-26T00:39:39Z

vllm_ascend/sample/rejection_sampler.py

+                    other=0
+                )
+                recovered_id = tl.argmax(tmp_target_prob, axis=-1)
+                max_p = tl.get_element(tmp_target_prob, (recovered_id,))


The function tl.get_element does not appear to be a valid function in the triton.language API. This will likely cause a compilation error. To get the value corresponding to the argmax index from a block tensor, you may need to use a different approach. Using tl.reduce with a custom binary operator to find both max value and index simultaneously is a common pattern for this.

gemini-code-assist · 2025-11-26T00:39:40Z

vllm_ascend/sample/rejection_sampler.py

+        rejected = False
+        for pos in range(num_draft_tokens):
+            if not rejected:
+                draft_token_id = tl.load(draft_token_ids_ptr + start_idx + pos)
+                target_argmax_id = tl.load(target_argmax_ptr + start_idx + pos)
+                tl.store(
+                    output_token_ids_ptr + req_idx * (max_spec_len + 1) + pos,
+                    target_argmax_id,
+                )
+                if draft_token_id != target_argmax_id:
+                    # Reject.
+                    rejected = True
+
+        if not rejected:
+            # If all tokens are accepted, append the bonus token.
+            bonus_token_id = tl.load(bonus_token_ids_ptr + req_idx)
+            tl.store(
+                output_token_ids_ptr + req_idx * (max_spec_len + 1) + num_draft_tokens,
+                bonus_token_id,
+            )


The current implementation of the loop continues to iterate even after a token has been rejected, which is inefficient. You can return from the function immediately after the first rejection to avoid unnecessary iterations. This simplifies the code by removing the rejected flag and the conditional block after the loop.

for pos in range(num_draft_tokens): draft_token_id = tl.load(draft_token_ids_ptr + start_idx + pos) target_argmax_id = tl.load(target_argmax_ptr + start_idx + pos) tl.store( output_token_ids_ptr + req_idx * (max_spec_len + 1) + pos, target_argmax_id, ) if draft_token_id != target_argmax_id: # Reject and stop processing this request. return # If the loop completes, all tokens were accepted. # Append the bonus token. bonus_token_id = tl.load(bonus_token_ids_ptr + req_idx) tl.store( output_token_ids_ptr + req_idx * (max_spec_len + 1) + num_draft_tokens, bonus_token_id, )

gemini-code-assist · 2025-11-26T00:39:40Z

vllm_ascend/sample/rejection_sampler.py

+except:
+    TRITON_ASCEND_AVAILABLE = False


Using a bare except: is generally discouraged as it can catch unexpected exceptions like SystemExit or KeyboardInterrupt, making it harder to debug issues. It's better to catch specific exceptions. In this case, ImportError seems more appropriate if you only want to handle cases where triton is not installed.

except ImportError: TRITON_ASCEND_AVAILABLE = False

Signed-off-by: chenaoxuan <[email protected]>

MengqingCao · 2025-11-26T08:16:30Z

vllm_ascend/sample/rejection_sampler.py

                                              generate_uniform_probs)
 from vllm.v1.spec_decode.metadata import SpecDecodeMetadata

+try:


we can use HAS_TRITON, plz refer to https://github.com/vllm-project/vllm/blob/d9d342d214b8c13f71215318a6d9252cc4a5ca47/vllm/triton_utils/importing.py#L12

gemini-code-assist bot reviewed Nov 26, 2025

View reviewed changes

chenaoxuan force-pushed the magicmtp branch 6 times, most recently from befa9e5 to 74390bb Compare November 26, 2025 03:25

Add MagicMTP(block verify) and Triton optimization

d770fba

Signed-off-by: chenaoxuan <[email protected]>

chenaoxuan force-pushed the magicmtp branch from 74390bb to d770fba Compare November 26, 2025 06:17

MengqingCao reviewed Nov 26, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add MagicMTP(block verify) and Triton optimization #4443

Add MagicMTP(block verify) and Triton optimization #4443

chenaoxuan commented Nov 26, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Nov 26, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Nov 26, 2025

Uh oh!

gemini-code-assist bot Nov 26, 2025

Uh oh!

gemini-code-assist bot Nov 26, 2025

Uh oh!

MengqingCao Nov 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add MagicMTP(block verify) and Triton optimization #4443

Are you sure you want to change the base?

Add MagicMTP(block verify) and Triton optimization #4443

Conversation

chenaoxuan commented Nov 26, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions bot commented Nov 26, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

MengqingCao Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chenaoxuan commented Nov 26, 2025 •

edited by github-actions bot

Loading