fix(inference): RoutedExpertsCapturer for DeepEP MK path + cudagraph … by samsja · Pull Request #2440 · PrimeIntellect-ai/prime-rl

samsja · 2026-05-08T02:12:34Z

…capture

vLLM 0.19's RoutedExpertsCapturer.capture() hardcodes assert cumsum[-1] == topk_ids.shape[0], which only holds when DefaultMoERunner is on its naive dispatch+combine path (DP combine concats all ranks' tokens before select_experts). With DEEPGEMM Fp8 MoE

deepep_high_throughput, supports_internal_mk=True and the modular- kernel path runs instead — DP combine happens inside quant_method.apply, so select_experts sees only this rank's tokens. Every DP worker trips the assert during CUDA-graph warmup and the engine cores die.

Patch capture() to mirror the post-refactor behavior on vLLM main (PR vllm-project/vllm#39917): only slice when topk_ids is the cross-DP concatenation; otherwise copy verbatim. Handles both the MK path AND the cudagraph-capture warmup case (where dp_metadata still claims max_num_batched_tokens per rank but the captured batch is one of the cudagraph_capture_sizes), both of which would trip the strict either/or check from the earlier upstream fix (vllm-project/vllm#37879).

Verified on a 12-node GLM-5.1-FP8 RL run (DP=16, EP, deepep_high_ throughput): inference comes up cleanly through warmup and into rollout generation.

…capture vLLM 0.19's RoutedExpertsCapturer.capture() hardcodes `assert cumsum[-1] == topk_ids.shape[0]`, which only holds when DefaultMoERunner is on its naive dispatch+combine path (DP combine concats all ranks' tokens before select_experts). With DEEPGEMM Fp8 MoE + deepep_high_throughput, supports_internal_mk=True and the modular- kernel path runs instead — DP combine happens inside quant_method.apply, so select_experts sees only this rank's tokens. Every DP worker trips the assert during CUDA-graph warmup and the engine cores die. Patch capture() to mirror the post-refactor behavior on vLLM main (PR vllm-project/vllm#39917): only slice when topk_ids is the cross-DP concatenation; otherwise copy verbatim. Handles both the MK path AND the cudagraph-capture warmup case (where dp_metadata still claims max_num_batched_tokens per rank but the captured batch is one of the cudagraph_capture_sizes), both of which would trip the strict either/or check from the earlier upstream fix (vllm-project/vllm#37879). Verified on a 12-node GLM-5.1-FP8 RL run (DP=16, EP, deepep_high_ throughput): inference comes up cleanly through warmup and into rollout generation. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(inference): RoutedExpertsCapturer for DeepEP MK path + cudagraph …#2440

fix(inference): RoutedExpertsCapturer for DeepEP MK path + cudagraph …#2440
samsja wants to merge 1 commit intomainfrom
fix/routed-experts-capturer-mk-path

samsja commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

samsja commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant