Skip to content

Suggestion: Clarify variable naming in Algorithm 1 for "Distributed Muon" #38

@JenWei0312

Description

@JenWei0312

Hi Moonshot AI team,

First, thank you for your excellent paper and for open-sourcing your work on the Muon optimizer. It's a fascinating contribution to the field.

I've been studying the paper and the Megatron-LM implementation in detail, and I had a small suggestion to improve the clarity of Algorithm 1 ("Distributed Muon") for future readers.

I was initially very confused by the use of the variable G and the term "gradient matrix" in the "DP Gather" step (lines 5-6). The algorithm begins by requiring "Full Gradients G," but the object gathered in line 6 is actually the gradient-updated momentum buffer (g'), not the raw gradient.

This was confusing for two reasons:

  1. It seems to reuse the variable G for two different things (initial raw gradient vs. final momentum buffer).
  2. In a ZeRO-1 context, the raw gradients are replicated, so the idea of "gathering" them seemed paradoxical.

My confusion was resolved when I realized the gathered object is an optimizer state (the momentum buffer), which is sharded under ZeRO-1 and therefore does need to be gathered.

Suggestion:
To improve clarity, perhaps the pseudocode could use a different variable (e.g., M_full or G_momentum) in line 6 to distinguish the gathered momentum buffer from the initial raw gradient G.

This is a minor terminological point, but I believe it would make the excellent algorithm even easier to understand for people trying to learn from your work.

Thanks again for the great research!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions