-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Custom (MCORE) FSDP interface #12391
base: main
Are you sure you want to change the base?
Changes from 4 commits
6aaf614
da27acd
806a7a3
36dcfa9
2ceb6c9
ea1cb0e
eeef832
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -689,6 +689,12 @@ def init_ddp(self): | |
) # We need to do this explicitly since this is a attr pytorch uses | ||
model_chunk.__class__.__getattr__ = getattr_proxy # type: ignore | ||
|
||
# Ensure that if using custom FSDP, gradient_accumulation_fusion is disabled on the model config. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this should be changed to this.
Also, There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I will modify the text as that. For the Torch FSDP2, I agree, but we don't have that in the NeMo yet. |
||
if self.ddp_config.use_custom_fsdp: | ||
assert ( | ||
module.config.gradient_accumulation_fusion == False | ||
), "gradient_accumulation_fusion must be False when using custom FSDP" | ||
|
||
# param_sync_func is set in nemo.lightning.pytorch.optim.megatron | ||
no_sync_func, grad_sync_func = extract_ddp_funcs(self.ddp_config, self) | ||
for module in self: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this valid only with
use_custom_fsdp=True
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think so. This is the default value in mcore https://gitlab-master.nvidia.com/ADLR/megatron-lm/-/blob/main/megatron/core/distributed/distributed_data_parallel_config.py?ref_type=heads#L63
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then, wouldn't
optim_grads_params
be the default option fron NeMo side? So,use_custom_fsdp=True
means Zero3.Also, need to describe this in the comment description; Custom (MCORE) FSDP Data parallel sharding strategy, choices=['no_shard', 'optim', 'optim_grads', 'optim_grads_params']
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modified the comment accordingly.