what is the dtype of muon？

Do optimizers like Adam and Muon include parameters, grad, and mount? If using fp32 (three times the number of parameters), a 1TB model would require 12TB of GPU memory, but the paper states this memory usage is negligible? Even using DP1024 requires 11GB, but using PP16 means at least 16384 H800 images?

The `toy_train` documentation states, "To efficiently orthogonalize each update, we use a Newton-Schulz iteration, which has the advantage that it can be stably run in bfloat16 on the GPU." Does this mean momentum uses bf16? If so, it would still require at least 9GB of GPU memory, which doesn't meet the paper's claim of "reducing its per-device memory footprint to a negligible level."

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

what is the dtype of muon？ #41

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

what is the dtype of muon？ #41

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions