Add ZeRO-3 leaf module support for Qwen MoE models #701
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds explicit support for Qwen Mixture-of-Experts (MoE) models when running with DeepSpeed ZeRO-3.
When
model.config.model_type == "qwen3_moe"
, the script sets theQwen3MoeSparseMoeBlock
as a ZeRO-3 leaf module.This ensures that collective communication works correctly during training.
Changes
deepspeed
andQwen3MoeSparseMoeBlock
inside the Qwen MoE branch.deepspeed.utils.set_z3_leaf_modules()
to register the MoE block.Notes
deepspeed
or the Qwen MoE block import fails, the script logs a warning and continues without modification.Motivation
Without this fix, training Qwen MoE models under ZeRO-3 may hang or suffer from incorrect collective operations.
This patch enables stable fine-tuning of Qwen MoE models within the
open-r1
training pipeline.Limitations / Future Work
Reference
A similar fix has been applied in the OpenRLHF repository:
OpenRLHF/OpenRLHF@d5fcb42#diff-da77c0ae1d958e6b8c491f9d6f1f8ad54ee9ab21c231d4b2490fb1c09af1046f