Skip to content

fix(inference): RoutedExpertsCapturer for DeepEP MK path + cudagraph …#2440

Draft
samsja wants to merge 1 commit intomainfrom
fix/routed-experts-capturer-mk-path
Draft

fix(inference): RoutedExpertsCapturer for DeepEP MK path + cudagraph …#2440
samsja wants to merge 1 commit intomainfrom
fix/routed-experts-capturer-mk-path

Conversation

@samsja
Copy link
Copy Markdown
Member

@samsja samsja commented May 8, 2026

…capture

vLLM 0.19's RoutedExpertsCapturer.capture() hardcodes assert cumsum[-1] == topk_ids.shape[0], which only holds when DefaultMoERunner is on its naive dispatch+combine path (DP combine concats all ranks' tokens before select_experts). With DEEPGEMM Fp8 MoE

  • deepep_high_throughput, supports_internal_mk=True and the modular- kernel path runs instead — DP combine happens inside quant_method.apply, so select_experts sees only this rank's tokens. Every DP worker trips the assert during CUDA-graph warmup and the engine cores die.

Patch capture() to mirror the post-refactor behavior on vLLM main (PR vllm-project/vllm#39917): only slice when topk_ids is the cross-DP concatenation; otherwise copy verbatim. Handles both the MK path AND the cudagraph-capture warmup case (where dp_metadata still claims max_num_batched_tokens per rank but the captured batch is one of the cudagraph_capture_sizes), both of which would trip the strict either/or check from the earlier upstream fix (vllm-project/vllm#37879).

Verified on a 12-node GLM-5.1-FP8 RL run (DP=16, EP, deepep_high_ throughput): inference comes up cleanly through warmup and into rollout generation.

…capture

vLLM 0.19's RoutedExpertsCapturer.capture() hardcodes
`assert cumsum[-1] == topk_ids.shape[0]`, which only holds when
DefaultMoERunner is on its naive dispatch+combine path (DP combine
concats all ranks' tokens before select_experts). With DEEPGEMM Fp8 MoE
+ deepep_high_throughput, supports_internal_mk=True and the modular-
kernel path runs instead — DP combine happens inside quant_method.apply,
so select_experts sees only this rank's tokens. Every DP worker trips
the assert during CUDA-graph warmup and the engine cores die.

Patch capture() to mirror the post-refactor behavior on vLLM main (PR
vllm-project/vllm#39917): only slice when topk_ids is the cross-DP
concatenation; otherwise copy verbatim. Handles both the MK path AND
the cudagraph-capture warmup case (where dp_metadata still claims
max_num_batched_tokens per rank but the captured batch is one of the
cudagraph_capture_sizes), both of which would trip the strict
either/or check from the earlier upstream fix (vllm-project/vllm#37879).

Verified on a 12-node GLM-5.1-FP8 RL run (DP=16, EP, deepep_high_
throughput): inference comes up cleanly through warmup and into rollout
generation.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant