Enable MoE (Qwen3-30B-A3B) with Expert Parallelism support by shivam-MBZUAI · Pull Request #87 · one-covenant/crusades

shivam-MBZUAI · 2026-04-09T08:51:34Z

Switch model to Qwen3-30B-A3B (30.5B total, 3.3B active params)
Add Expert Parallelism (EP) detection and validation in env.py
Add train_ep.py reference implementation with EPMoeBlock, gradient sync, and full state dict gathering via broadcast
Fix FSDP reference run: transformer_auto_wrap_policy, disable no_sync to prevent full-gradient OOM, disable gradient checkpointing
Use SGD optimizer (lr=0.1) across all strategies for bf16 compatibility
MFU calculation uses active params (3.3B) for MoE models
Update all local train files for new model and optimizer

Made-with: Cursor

Summary by CodeRabbit

New Features
- Added Expert-Parallel (EP) training support for Mixture-of-Experts models with dedicated training module and distributed expert sharding.
- Separate tokenizer caching in Docker builds for improved artifact management.
Improvements
- Updated default optimizer configuration across training scripts.
- Enhanced gradient checkpointing with improved memory efficiency settings.
- Optimized training loop with micro-batch gradient accumulation and MoE active parameter counting.

- Switch model to Qwen3-30B-A3B (30.5B total, 3.3B active params) - Add Expert Parallelism (EP) detection and validation in env.py - Add train_ep.py reference implementation with EPMoeBlock, gradient sync, and full state dict gathering via broadcast - Fix FSDP reference run: transformer_auto_wrap_policy, disable no_sync to prevent full-gradient OOM, disable gradient checkpointing - Use SGD optimizer (lr=0.1) across all strategies for bf16 compatibility - MFU calculation uses active params (3.3B) for MoE models - Update all local train files for new model and optimizer Made-with: Cursor

coderabbitai · 2026-04-09T08:51:48Z

Walkthrough

The PR introduces expert-parallel (EP) training support for Mixture-of-Experts (MoE) models by extending parallelism configuration with ep_size, implementing a new EP training module with custom all-reduce logic, switching the default optimizer from AdamW to SGD across all training scripts, and refactoring gradient checkpointing and loss computation pipelines.

Changes

Cohort / File(s)	Summary
Parallelism Infrastructure `environments/templar/env.py`, `environments/templar/Dockerfile`	Extended `ParallelismConfig` with `ep_size` field and updated parsing/validation. Added MoE-aware parameter counting (active params vs. total when experts present), rank-0-only initial state caching, and FSDP mixed-precision (bf16) with backward prefetch configuration. Updated Dockerfile to derive tokenizer name from `hparams.json` via `benchmark_tokenizer_name`, defaulting to model name.
Expert-Parallel Training `local_test/train_ep.py`	New module implementing expert-parallel MoE training with `_EPAllReduce` autograd function for cross-EP communication, `EPMoeBlock` wrapping for expert sharding, gradient synchronization excluding expert parameters, and full state dict reconstruction via expert flattening/broadcasting.
Training Script Convergence `local_test/train.py`, `local_test/train_ddp.py`, `local_test/train_fsdp.py`, `local_test/train_mixed.py`, `local_test/train_tp.py`	Consistent updates across all training scripts: added `ep_size: 1` to `get_strategy()` return values, replaced AdamW optimizer (lr=1e-4, fused) with SGD (lr=0.1, momentum=0.0), and standardized gradient checkpointing kwargs (`use_reentrant=False`, `preserve_rng_state=False`). Removed chunked loss implementations in favor of direct `torch.nn.functional.cross_entropy`. `train.py` additionally removes selective activation checkpointing and torch.compile usage.
Documentation & Testing `local_test/simulate_validator.py`	Updated Docker run instructions (removed `-it` flags) and adjusted strategy examples to reflect Qwen3-30B-A3B MoE configurations including `ep_size`. Updated mount path references from `train.py` to `train_ep.py` for single-GPU scenario.

Sequence Diagram

sequenceDiagram
    participant Trainer
    participant Model as Model (Sharded<br/>Experts)
    participant RouterLogits as Router & Top-K
    participant LocalExperts as Local Expert<br/>Processors
    participant EPAllReduce as EP AllReduce<br/>Group
    participant Optimizer

    Trainer->>Model: Forward Pass
    Model->>RouterLogits: Compute router logits
    RouterLogits->>RouterLogits: Compute routing weights<br/>(softmax + top-k)
    RouterLogits->>LocalExperts: Select top-k experts<br/>& token indices
    LocalExperts->>LocalExperts: Execute only local<br/>expert shards
    LocalExperts->>EPAllReduce: Sum expert outputs<br/>across EP ranks
    EPAllReduce->>Model: Aggregated expert output
    Model->>Trainer: Logits + Router Logits
    
    Trainer->>Trainer: Compute next-token<br/>cross-entropy loss
    Trainer->>Trainer: Backward propagation
    Trainer->>EPAllReduce: All-reduce replicated<br/>gradients (excl. experts)
    EPAllReduce->>Optimizer: Synchronized gradients
    Optimizer->>Model: SGD update with<br/>lr=0.1, momentum=0.0

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

crusades#49: Extends ParallelismConfig with ep_size and updates get_strategy() dict parsing in environments/templar/env.py, directly overlapping with parallelism topology changes.
crusades#76: Modifies environments/templar/Dockerfile tokenizer handling to read benchmark_tokenizer_name from hparams, matching the same tokenizer download refactoring.
crusades#30: Updates multi-GPU evaluation wiring and environment configuration in environments/templar/env.py for DDP/FSDP/TP support, complementary to the new EP infrastructure additions.

Suggested reviewers

joellidin

🐰 Expert experts sharding their skills,
SGD optimizes with steady will,
Tokens flow through parallel lanes,
MoE routers route through neural trains,
One step closer to the finest scales! 🚀

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 21.95% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main change: enabling MoE support with Expert Parallelism for Qwen3-30B-A3B model, which is the primary objective of the PR.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/moe

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (3)

local_test/train_ep.py (2)

171-171: Optional: Rename unused loop variable step to _step.

Per Python conventions, prefix unused variables with underscore.

Suggested fix

-    for step in range(num_steps):
+    for _step in range(num_steps):

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@local_test/train_ep.py` at line 171, The loop variable `step` in the training
loop "for step in range(num_steps):" is unused; rename it to `_step` to follow
Python conventions for unused variables and silence linters — update the loop
header in train_ep.py accordingly (e.g., change `for step in range(num_steps):`
to use `_step`) and ensure no other references to `step` exist in the
surrounding function.

123-123: Add strict=True to zip() for safety.

The expert_items and shapes lists should always have the same length since they're derived from the same source, but adding strict=True would catch any unexpected mismatch.

Suggested fix

-                for (name, _), shape in zip(expert_items, shapes):
+                for (name, _), shape in zip(expert_items, shapes, strict=True):

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@local_test/train_ep.py` at line 123, The loop using zip(expert_items, shapes)
should enforce length equality: update the zip call in the for loop that
iterates "for (name, _), shape in zip(expert_items, shapes):" to use zip(...,
strict=True) so any mismatch between expert_items and shapes raises an immediate
error; modify that specific zip invocation accordingly.

local_test/train.py (1)

99-99: Optional: Rename unused loop variable step to _step.

The loop variable is unused within the loop body. Renaming to _step follows Python conventions for intentionally unused variables and silences the linter warning.
Suggested fix
-    for step in range(num_steps):
+    for _step in range(num_steps):
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@local_test/train.py` at line 99, The loop variable "step" in the for loop
"for step in range(num_steps):" is unused and should be renamed to "_step" to
follow Python conventions and silence linters; update the loop header in
train.py to use "for _step in range(num_steps):" and ensure no other references
to "step" exist elsewhere in that scope (adjust if necessary).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@local_test/train_ep.py`:
- Line 171: The loop variable `step` in the training loop "for step in
range(num_steps):" is unused; rename it to `_step` to follow Python conventions
for unused variables and silence linters — update the loop header in train_ep.py
accordingly (e.g., change `for step in range(num_steps):` to use `_step`) and
ensure no other references to `step` exist in the surrounding function.
- Line 123: The loop using zip(expert_items, shapes) should enforce length
equality: update the zip call in the for loop that iterates "for (name, _),
shape in zip(expert_items, shapes):" to use zip(..., strict=True) so any
mismatch between expert_items and shapes raises an immediate error; modify that
specific zip invocation accordingly.

In `@local_test/train.py`:
- Line 99: The loop variable "step" in the for loop "for step in
range(num_steps):" is unused and should be renamed to "_step" to follow Python
conventions and silence linters; update the loop header in train.py to use "for
_step in range(num_steps):" and ensure no other references to "step" exist
elsewhere in that scope (adjust if necessary).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1c468eea-635c-4851-944c-50b656f788c8

📥 Commits

Reviewing files that changed from the base of the PR and between 5b9a0c2 and a781afc.

📒 Files selected for processing (9)

environments/templar/Dockerfile
environments/templar/env.py
local_test/simulate_validator.py
local_test/train.py
local_test/train_ddp.py
local_test/train_ep.py
local_test/train_fsdp.py
local_test/train_mixed.py
local_test/train_tp.py

coderabbitai Bot reviewed Apr 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable MoE (Qwen3-30B-A3B) with Expert Parallelism support#87

Enable MoE (Qwen3-30B-A3B) with Expert Parallelism support#87
shivam-MBZUAI wants to merge 1 commit into
mainfrom
feat/moe

shivam-MBZUAI commented Apr 9, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 9, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

shivam-MBZUAI commented Apr 9, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

shivam-MBZUAI commented Apr 9, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 9, 2026 •

edited

Loading