-
Notifications
You must be signed in to change notification settings - Fork 78
Open
Description
Do optimizers like Adam and Muon include parameters, grad, and mount? If using fp32 (three times the number of parameters), a 1TB model would require 12TB of GPU memory, but the paper states this memory usage is negligible? Even using DP1024 requires 11GB, but using PP16 means at least 16384 H800 images?
The toy_train documentation states, "To efficiently orthogonalize each update, we use a Newton-Schulz iteration, which has the advantage that it can be stably run in bfloat16 on the GPU." Does this mean momentum uses bf16? If so, it would still require at least 9GB of GPU memory, which doesn't meet the paper's claim of "reducing its per-device memory footprint to a negligible level."
Metadata
Metadata
Assignees
Labels
No labels