-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Custom (MCORE) FSDP interface #12391
base: main
Are you sure you want to change the base?
Custom (MCORE) FSDP interface #12391
Conversation
Signed-off-by: Youngeun Kwon <[email protected]>
Signed-off-by: Youngeun Kwon <[email protected]>
Signed-off-by: youngeunkwon0405 <[email protected]>
@@ -57,6 +57,8 @@ | |||
bucket_size: Optional[int] = None # Maximum number of parameters in each bucket | |||
average_in_collective: bool = False # If true, compute average in collective directly, as opposed to dividing by the dp_size first and then computing sum in the collective | |||
fp8_param_gather: bool = False # If true, keep the compute param in fp8 (do not use any other intermediate dtype) and perform the param all-gather in fp8 | |||
use_custom_fsdp: bool = False # If true, use MCore's custom FSDP implementation. recipe.model.config.gradient_accumulation_fusion must be False when using this | |||
data_parallel_sharding_strategy: str = "no_shard" # Data parallel sharding strategy, choices=['no_shard', 'optim', 'optim_grads', 'optim_grads_params'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this valid only with use_custom_fsdp=True
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think so. This is the default value in mcore https://gitlab-master.nvidia.com/ADLR/megatron-lm/-/blob/main/megatron/core/distributed/distributed_data_parallel_config.py?ref_type=heads#L63
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then, wouldn't optim_grads_params
be the default option fron NeMo side? So, use_custom_fsdp=True
means Zero3.
Also, need to describe this in the comment description; Custom (MCORE) FSDP Data parallel sharding strategy, choices=['no_shard', 'optim', 'optim_grads', 'optim_grads_params']
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modified the comment accordingly.
nemo/lightning/megatron_parallel.py
Outdated
@@ -689,6 +689,12 @@ def init_ddp(self): | |||
) # We need to do this explicitly since this is a attr pytorch uses | |||
model_chunk.__class__.__getattr__ = getattr_proxy # type: ignore | |||
|
|||
# Ensure that if using custom FSDP, gradient_accumulation_fusion is disabled on the model config. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be changed to this.
gradient_accumulation_fusion cannot be used with MCORE FSDP
Also, gradient_accumulation_fusion
shouldn't work with FSDP2 eight because we need to ReduceScatter the current gradient before accumulation local grad shard partial sum.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will modify the text as that. For the Torch FSDP2, I agree, but we don't have that in the NeMo yet.
Please change the name of PR to custom (MCORE) FSDP interface. Also, need to add other FSDP args to the arg map.
|
Signed-off-by: Youngeun Kwon <[email protected]>
Signed-off-by: Youngeun Kwon <[email protected]>
Are the args properly ported to MCORE modules? |
Okay, I will also add these. |
Yes, it was working well. |
Signed-off-by: Youngeun Kwon <[email protected]>
@erhoo82 I guess the |
@youngeunkwon0405 What fix does |
Also, Can we set the default value of Not sure why |
I was speculating like this because some modification was required in MLM https://gitlab-master.nvidia.com/ADLR/megatron-lm/-/merge_requests/2071/diffs#67e48551394c01c5c3935d0a3546c2404cfce8a4_521_549 We need to test it first. |
Important
The
Update branch
button must only be pressed in very rare occassions.An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.
What does this PR do ?
Add a one line overview of what this PR aims to accomplish.
Collection: [Note which collection this PR will affect]
Changelog
Usage
# Add a code snippet demonstrating how to use this
GitHub Actions CI
The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.
The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information