add deepep and test, modify test_moe_quant and test_attention_quant#349
add deepep and test, modify test_moe_quant and test_attention_quant#349qiushi13 wants to merge 13 commits into
Conversation
Claude Code ReviewVerdict: Request changes -- Combine path has a correctness bug in the per-rank slice of the global scatter and a likely-incorrect quant rounding rule. SummaryAdds a torch reference implementation of DeepEP-style MoE dispatch/combine plus accuracy tests, and extends the quant MoE test to a distributed xops-vs-torch comparison. The torch backend is intended as a reference for the xops kernel. Must fix
SuggestionsSuggestions (5)
NitsNits (3)
Notes
|
There was a problem hiding this comment.
Code Review
This pull request introduces DeepEP-style cross-rank MoE dispatch and combine operators (MojoDeepEPDispatch and MojoDeepEPCombine), along with corresponding accuracy and distributed tests. It also updates the MoE operator to pass top_k to MojoQuantExperts and expands the test coverage for quantized MoE and attention operators. Feedback highlights a potential division-by-zero issue in the dispatch operator when calculating expand_scale and suggests avoiding monkey-patching _dispatch_up_proj_inv_smooth_scale in the tests.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| expand_scale = smoothed.abs().amax(-1, keepdim=True) / 127.0 | ||
| x = smoothed / expand_scale |
There was a problem hiding this comment.
There's a potential division-by-zero issue here. If a row in smoothed is all zeros, expand_scale will be zero, leading to 0.0 / 0.0 which results in NaN. This can propagate through the model and cause correctness issues.
You should use a safe division pattern to handle this case. Using torch.nan_to_num is a concise way to ensure that any NaN resulting from 0/0 is converted to 0, which is the correct behavior here.
| expand_scale = smoothed.abs().amax(-1, keepdim=True) / 127.0 | |
| x = smoothed / expand_scale | |
| expand_scale = smoothed.abs().amax(-1, keepdim=True) / 127.0 | |
| x = torch.nan_to_num(smoothed / expand_scale) |
| ).to(device) | ||
| op.load_state_dict(state_dict) | ||
| if is_xops_backend: | ||
| op._dispatch_up_proj_inv_smooth_scale = op.experts.up_proj_quantize.inv_smooth_scale |
There was a problem hiding this comment.
You are monkey-patching the _dispatch_up_proj_inv_smooth_scale attribute onto the op instance. This attribute is not declared in the MojoQuantMoE class __init__ method, which can make the code harder to understand and maintain. It suggests a hidden dependency for the xops backend.
While this might be a necessary workaround for testing, consider a cleaner approach for passing this data to the operator. For example, you could pass it as a keyword argument to the __init__ method if the backend implementation supports it, or add a dedicated setter method on the operator. This would make the data flow more explicit.
Claude Code ReviewVerdict: Request changes -- Torch reference for DeepEP combine has a global-scatter indexing bug that will break correctness for SummaryAdds Must fix
SuggestionsSuggestions (5)
NitsNits (3)
Notes
|
增加MojoFusedAttnGateConcat,MojoGatherRopeStore,MojoPagedAttentionStoreKvCache,MojoPagedCacheDequant,MojoRotaryEmbedding算子定义
feat: add fused ag scale quant and qk rmsnorm
No description provided.