Suggestion: Clarify variable naming in Algorithm 1 for "Distributed Muon"

Hi Moonshot AI team,

First, thank you for your excellent paper and for open-sourcing your work on the Muon optimizer. It's a fascinating contribution to the field.

I've been studying the paper and the Megatron-LM implementation in detail, and I had a small suggestion to improve the clarity of Algorithm 1 ("Distributed Muon") for future readers.

I was initially very confused by the use of the variable `G` and the term "gradient matrix" in the "DP Gather" step (lines 5-6). The algorithm begins by requiring "Full Gradients G," but the object gathered in line 6 is actually the **gradient-updated momentum buffer** (`g'`), not the raw gradient.

This was confusing for two reasons:
1.  It seems to reuse the variable `G` for two different things (initial raw gradient vs. final momentum buffer).
2.  In a ZeRO-1 context, the raw gradients are replicated, so the idea of "gathering" them seemed paradoxical.

My confusion was resolved when I realized the gathered object is an **optimizer state** (the momentum buffer), which *is* sharded under ZeRO-1 and therefore *does* need to be gathered.

**Suggestion:**
To improve clarity, perhaps the pseudocode could use a different variable (e.g., `M_full` or `G_momentum`) in line 6 to distinguish the gathered momentum buffer from the initial raw gradient `G`.

This is a minor terminological point, but I believe it would make the excellent algorithm even easier to understand for people trying to learn from your work.

Thanks again for the great research!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Suggestion: Clarify variable naming in Algorithm 1 for "Distributed Muon" #38

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Suggestion: Clarify variable naming in Algorithm 1 for "Distributed Muon" #38

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions