-
Notifications
You must be signed in to change notification settings - Fork 78
Description
Hi Moonshot AI team,
First, thank you for your excellent paper and for open-sourcing your work on the Muon optimizer. It's a fascinating contribution to the field.
I've been studying the paper and the Megatron-LM implementation in detail, and I had a small suggestion to improve the clarity of Algorithm 1 ("Distributed Muon") for future readers.
I was initially very confused by the use of the variable G and the term "gradient matrix" in the "DP Gather" step (lines 5-6). The algorithm begins by requiring "Full Gradients G," but the object gathered in line 6 is actually the gradient-updated momentum buffer (g'), not the raw gradient.
This was confusing for two reasons:
- It seems to reuse the variable
Gfor two different things (initial raw gradient vs. final momentum buffer). - In a ZeRO-1 context, the raw gradients are replicated, so the idea of "gathering" them seemed paradoxical.
My confusion was resolved when I realized the gathered object is an optimizer state (the momentum buffer), which is sharded under ZeRO-1 and therefore does need to be gathered.
Suggestion:
To improve clarity, perhaps the pseudocode could use a different variable (e.g., M_full or G_momentum) in line 6 to distinguish the gathered momentum buffer from the initial raw gradient G.
This is a minor terminological point, but I believe it would make the excellent algorithm even easier to understand for people trying to learn from your work.
Thanks again for the great research!