Skip to content

what is the dtype of muon? #41

@lk137095576

Description

@lk137095576

Do optimizers like Adam and Muon include parameters, grad, and mount? If using fp32 (three times the number of parameters), a 1TB model would require 12TB of GPU memory, but the paper states this memory usage is negligible? Even using DP1024 requires 11GB, but using PP16 means at least 16384 H800 images?

The toy_train documentation states, "To efficiently orthogonalize each update, we use a Newton-Schulz iteration, which has the advantage that it can be stably run in bfloat16 on the GPU." Does this mean momentum uses bf16? If so, it would still require at least 9GB of GPU memory, which doesn't meet the paper's claim of "reducing its per-device memory footprint to a negligible level."

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions