Skip to content

Enable MoE (Qwen3-30B-A3B) with Expert Parallelism support#87

Open
shivam-MBZUAI wants to merge 1 commit into
mainfrom
feat/moe
Open

Enable MoE (Qwen3-30B-A3B) with Expert Parallelism support#87
shivam-MBZUAI wants to merge 1 commit into
mainfrom
feat/moe

Conversation

@shivam-MBZUAI
Copy link
Copy Markdown
Contributor

@shivam-MBZUAI shivam-MBZUAI commented Apr 9, 2026

  • Switch model to Qwen3-30B-A3B (30.5B total, 3.3B active params)
  • Add Expert Parallelism (EP) detection and validation in env.py
  • Add train_ep.py reference implementation with EPMoeBlock, gradient sync, and full state dict gathering via broadcast
  • Fix FSDP reference run: transformer_auto_wrap_policy, disable no_sync to prevent full-gradient OOM, disable gradient checkpointing
  • Use SGD optimizer (lr=0.1) across all strategies for bf16 compatibility
  • MFU calculation uses active params (3.3B) for MoE models
  • Update all local train files for new model and optimizer

Made-with: Cursor

Summary by CodeRabbit

  • New Features

    • Added Expert-Parallel (EP) training support for Mixture-of-Experts models with dedicated training module and distributed expert sharding.
    • Separate tokenizer caching in Docker builds for improved artifact management.
  • Improvements

    • Updated default optimizer configuration across training scripts.
    • Enhanced gradient checkpointing with improved memory efficiency settings.
    • Optimized training loop with micro-batch gradient accumulation and MoE active parameter counting.

- Switch model to Qwen3-30B-A3B (30.5B total, 3.3B active params)
- Add Expert Parallelism (EP) detection and validation in env.py
- Add train_ep.py reference implementation with EPMoeBlock, gradient
  sync, and full state dict gathering via broadcast
- Fix FSDP reference run: transformer_auto_wrap_policy, disable
  no_sync to prevent full-gradient OOM, disable gradient checkpointing
- Use SGD optimizer (lr=0.1) across all strategies for bf16 compatibility
- MFU calculation uses active params (3.3B) for MoE models
- Update all local train files for new model and optimizer

Made-with: Cursor
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 9, 2026

Walkthrough

The PR introduces expert-parallel (EP) training support for Mixture-of-Experts (MoE) models by extending parallelism configuration with ep_size, implementing a new EP training module with custom all-reduce logic, switching the default optimizer from AdamW to SGD across all training scripts, and refactoring gradient checkpointing and loss computation pipelines.

Changes

Cohort / File(s) Summary
Parallelism Infrastructure
environments/templar/env.py, environments/templar/Dockerfile
Extended ParallelismConfig with ep_size field and updated parsing/validation. Added MoE-aware parameter counting (active params vs. total when experts present), rank-0-only initial state caching, and FSDP mixed-precision (bf16) with backward prefetch configuration. Updated Dockerfile to derive tokenizer name from hparams.json via benchmark_tokenizer_name, defaulting to model name.
Expert-Parallel Training
local_test/train_ep.py
New module implementing expert-parallel MoE training with _EPAllReduce autograd function for cross-EP communication, EPMoeBlock wrapping for expert sharding, gradient synchronization excluding expert parameters, and full state dict reconstruction via expert flattening/broadcasting.
Training Script Convergence
local_test/train.py, local_test/train_ddp.py, local_test/train_fsdp.py, local_test/train_mixed.py, local_test/train_tp.py
Consistent updates across all training scripts: added ep_size: 1 to get_strategy() return values, replaced AdamW optimizer (lr=1e-4, fused) with SGD (lr=0.1, momentum=0.0), and standardized gradient checkpointing kwargs (use_reentrant=False, preserve_rng_state=False). Removed chunked loss implementations in favor of direct torch.nn.functional.cross_entropy. train.py additionally removes selective activation checkpointing and torch.compile usage.
Documentation & Testing
local_test/simulate_validator.py
Updated Docker run instructions (removed -it flags) and adjusted strategy examples to reflect Qwen3-30B-A3B MoE configurations including ep_size. Updated mount path references from train.py to train_ep.py for single-GPU scenario.

Sequence Diagram

sequenceDiagram
    participant Trainer
    participant Model as Model (Sharded<br/>Experts)
    participant RouterLogits as Router & Top-K
    participant LocalExperts as Local Expert<br/>Processors
    participant EPAllReduce as EP AllReduce<br/>Group
    participant Optimizer

    Trainer->>Model: Forward Pass
    Model->>RouterLogits: Compute router logits
    RouterLogits->>RouterLogits: Compute routing weights<br/>(softmax + top-k)
    RouterLogits->>LocalExperts: Select top-k experts<br/>& token indices
    LocalExperts->>LocalExperts: Execute only local<br/>expert shards
    LocalExperts->>EPAllReduce: Sum expert outputs<br/>across EP ranks
    EPAllReduce->>Model: Aggregated expert output
    Model->>Trainer: Logits + Router Logits
    
    Trainer->>Trainer: Compute next-token<br/>cross-entropy loss
    Trainer->>Trainer: Backward propagation
    Trainer->>EPAllReduce: All-reduce replicated<br/>gradients (excl. experts)
    EPAllReduce->>Optimizer: Synchronized gradients
    Optimizer->>Model: SGD update with<br/>lr=0.1, momentum=0.0
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

  • crusades#49: Extends ParallelismConfig with ep_size and updates get_strategy() dict parsing in environments/templar/env.py, directly overlapping with parallelism topology changes.
  • crusades#76: Modifies environments/templar/Dockerfile tokenizer handling to read benchmark_tokenizer_name from hparams, matching the same tokenizer download refactoring.
  • crusades#30: Updates multi-GPU evaluation wiring and environment configuration in environments/templar/env.py for DDP/FSDP/TP support, complementary to the new EP infrastructure additions.

Suggested reviewers

  • joellidin

🐰 Expert experts sharding their skills,
SGD optimizes with steady will,
Tokens flow through parallel lanes,
MoE routers route through neural trains,
One step closer to the finest scales! 🚀

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 21.95% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: enabling MoE support with Expert Parallelism for Qwen3-30B-A3B model, which is the primary objective of the PR.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/moe

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
local_test/train_ep.py (2)

171-171: Optional: Rename unused loop variable step to _step.

Per Python conventions, prefix unused variables with underscore.

Suggested fix
-    for step in range(num_steps):
+    for _step in range(num_steps):
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@local_test/train_ep.py` at line 171, The loop variable `step` in the training
loop "for step in range(num_steps):" is unused; rename it to `_step` to follow
Python conventions for unused variables and silence linters — update the loop
header in train_ep.py accordingly (e.g., change `for step in range(num_steps):`
to use `_step`) and ensure no other references to `step` exist in the
surrounding function.

123-123: Add strict=True to zip() for safety.

The expert_items and shapes lists should always have the same length since they're derived from the same source, but adding strict=True would catch any unexpected mismatch.

Suggested fix
-                for (name, _), shape in zip(expert_items, shapes):
+                for (name, _), shape in zip(expert_items, shapes, strict=True):
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@local_test/train_ep.py` at line 123, The loop using zip(expert_items, shapes)
should enforce length equality: update the zip call in the for loop that
iterates "for (name, _), shape in zip(expert_items, shapes):" to use zip(...,
strict=True) so any mismatch between expert_items and shapes raises an immediate
error; modify that specific zip invocation accordingly.
local_test/train.py (1)

99-99: Optional: Rename unused loop variable step to _step.

The loop variable is unused within the loop body. Renaming to _step follows Python conventions for intentionally unused variables and silences the linter warning.

Suggested fix
-    for step in range(num_steps):
+    for _step in range(num_steps):
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@local_test/train.py` at line 99, The loop variable "step" in the for loop
"for step in range(num_steps):" is unused and should be renamed to "_step" to
follow Python conventions and silence linters; update the loop header in
train.py to use "for _step in range(num_steps):" and ensure no other references
to "step" exist elsewhere in that scope (adjust if necessary).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@local_test/train_ep.py`:
- Line 171: The loop variable `step` in the training loop "for step in
range(num_steps):" is unused; rename it to `_step` to follow Python conventions
for unused variables and silence linters — update the loop header in train_ep.py
accordingly (e.g., change `for step in range(num_steps):` to use `_step`) and
ensure no other references to `step` exist in the surrounding function.
- Line 123: The loop using zip(expert_items, shapes) should enforce length
equality: update the zip call in the for loop that iterates "for (name, _),
shape in zip(expert_items, shapes):" to use zip(..., strict=True) so any
mismatch between expert_items and shapes raises an immediate error; modify that
specific zip invocation accordingly.

In `@local_test/train.py`:
- Line 99: The loop variable "step" in the for loop "for step in
range(num_steps):" is unused and should be renamed to "_step" to follow Python
conventions and silence linters; update the loop header in train.py to use "for
_step in range(num_steps):" and ensure no other references to "step" exist
elsewhere in that scope (adjust if necessary).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1c468eea-635c-4851-944c-50b656f788c8

📥 Commits

Reviewing files that changed from the base of the PR and between 5b9a0c2 and a781afc.

📒 Files selected for processing (9)
  • environments/templar/Dockerfile
  • environments/templar/env.py
  • local_test/simulate_validator.py
  • local_test/train.py
  • local_test/train_ddp.py
  • local_test/train_ep.py
  • local_test/train_fsdp.py
  • local_test/train_mixed.py
  • local_test/train_tp.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant