-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recipe changes for performance #11763
Conversation
Signed-off-by: Guyue Huang <[email protected]> Conflicts: nemo/lightning/run/plugins.py
Signed-off-by: Guyue Huang <[email protected]>
Signed-off-by: Guyue Huang <[email protected]>
Signed-off-by: Guyue Huang <[email protected]>
…ability Signed-off-by: Guyue Huang <[email protected]>
…ections Signed-off-by: Guyue Huang <[email protected]>
Conflicts: nemo/lightning/run/plugins.py
Signed-off-by: guyueh1 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…e and use default" This reverts commit 2b8114b.
Signed-off-by: Guyue Huang <[email protected]>
@@ -168,3 +182,17 @@ class TransformerLayerTPOverlapCfg: | |||
proj_fprop=PipelineOverlapCfg(num_sm=24, cga_size=2, num_splits=4, set_sm_margin=True, fp8_buf=True), | |||
fc2_fprop=RingExchangeOverlapCfg(num_sm=1, set_sm_margin=True), | |||
) | |||
|
|||
# Nemotron 340B | |||
userbuffers_bf16_h100_h18432_tp8_mbs1_seqlen4096 = TransformerLayerTPOverlapCfg( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this an overlap config for hopper or blackwell?
nemo/lightning/run/plugins.py
Outdated
if tp_size > 1 or cp_size > 1: | ||
executor.env_vars["CUDA_DEVICE_MAX_CONNECTIONS"] = "1" | ||
if torch.cuda.is_available(): | ||
major, _ = torch.cuda.get_device_capability() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@erhoo82 This method won't work because it's run on the cluster frontend node, not after slurm allocation. We need to found another way.
Signed-off-by: Guyue Huang <[email protected]>
Head branch was pushed to by a user without write access
Signed-off-by: Guyue Huang <[email protected]>
os.environ.pop('CUDA_DEVICE_MAX_CONNECTIONS') | ||
else: | ||
if tp_size > 1 or cp_size > 1: | ||
os.environ['CUDA_DEVICE_MAX_CONNECTIONS'] = "1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could also be good to add a doc string for this condition.
Set the device connection to 1 to enforce the kernel queuing order from the host to the execution order on GPU. This is needed to schedule a communication kernel before the overlapping persistent GEMM kernel. Otherwise, the communication kernel will be pushed to the end of the GEMM kernel so failing to overlap the kernels
if major > 9: | ||
if (tp_size > 1 or cp_size > 1) and (dp_size > 1 or pp_size > 1): | ||
# Default is 8, but for this case, we need extra connections | ||
# to avoid serialization of streams |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Default is 8, but for this case, we need extra connections to avoid serialization of streams
to
We need extra connections to avoid serialization of streams, so we use the max connections of 32 instead of the default device connection of 8.
Signed-off-by: Guyue Huang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Signed-off-by: Guyue Huang <[email protected]>
Signed-off-by: Guyue Huang <[email protected]>
beep boop 🤖: 🙏 The following files have warnings. In case you are familiar with these, please try helping us to improve the code base. Your code was analyzed with PyLint. The following annotations have been identified:
Mitigation guide:
By applying these rules, we reduce the occurance of this message in future. Thank you for improving NeMo's documentation! |
@erhoo82 I pushed some minor changes (add debug logging), need approve again |
Signed-off-by: Guyue Huang <[email protected]>
Head branch was pushed to by a user without write access
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #11763 +/- ##
==========================================
+ Coverage 30.30% 30.47% +0.17%
==========================================
Files 1387 1395 +8
Lines 176283 177041 +758
Branches 27091 27147 +56
==========================================
+ Hits 53423 53956 +533
- Misses 118776 118975 +199
- Partials 4084 4110 +26 ☔ View full report in Codecov by Sentry. |
Tests passed except for one optional test (T5 training, seems the assertion error is irrelevant to my changes). |
* [Nemo2] allow setting CUDA_DEVICE_MAX_CONNECTIONS Signed-off-by: Guyue Huang <[email protected]> Conflicts: nemo/lightning/run/plugins.py * Add a tp2 ub config Signed-off-by: Guyue Huang <[email protected]> * Recipe tuning for mixtral, nemotron4 Signed-off-by: Guyue Huang <[email protected]> * Revert mixtral config change Signed-off-by: Guyue Huang <[email protected]> * Decide cuda device max connections based on torch.cuda.get_device_capability Signed-off-by: Guyue Huang <[email protected]> * Rename custom_cuda_device_max_connections to num_cuda_device_max_connections Signed-off-by: Guyue Huang <[email protected]> * Apply isort and black reformatting Signed-off-by: guyueh1 <[email protected]> * Remove explicit config of align_param_gather in mixtral recipe and use default * Revert "Remove explicit config of align_param_gather in mixtral recipe and use default" This reverts commit 2b8114b. * Rename ub config; change proj to ring exchange for nemotron 340b Signed-off-by: Guyue Huang <[email protected]> * Update the logic to set cuda_device_max_connections Signed-off-by: Guyue Huang <[email protected]> * Revert changes to PerfEnvPlugin Signed-off-by: Guyue Huang <[email protected]> * Move setup of CUDA_DEVICE_MAX_CONNECTIONS to MegatronCommOverlapCallback Signed-off-by: Guyue Huang <[email protected]> * Apply isort and black reformatting Signed-off-by: guyueh1 <[email protected]> * Add b200 tp overlap configs for gpt3 and llama3 models Signed-off-by: Guyue Huang <[email protected]> * Revert changes to nemotron recipe; will put those changes in performance scripts in a separate PR Signed-off-by: Guyue Huang <[email protected]> * Add two docstrings Signed-off-by: Guyue Huang <[email protected]> * Fix os.environ.pop Signed-off-by: Guyue Huang <[email protected]> * Add logging when setting CUDA_DEVICE_MAX_CONNECTIONS Signed-off-by: Guyue Huang <[email protected]> * Fix pylint and flake8 Signed-off-by: Guyue Huang <[email protected]> --------- Signed-off-by: Guyue Huang <[email protected]> Signed-off-by: Guyue Huang <[email protected]> Signed-off-by: guyueh1 <[email protected]> Co-authored-by: Guyue Huang <[email protected]> Co-authored-by: guyueh1 <[email protected]> Co-authored-by: Sangkug Lym <[email protected]> Co-authored-by: Guyue Huang <[email protected]>
LGTM |
* [Nemo2] allow setting CUDA_DEVICE_MAX_CONNECTIONS Signed-off-by: Guyue Huang <[email protected]> Conflicts: nemo/lightning/run/plugins.py * Add a tp2 ub config Signed-off-by: Guyue Huang <[email protected]> * Recipe tuning for mixtral, nemotron4 Signed-off-by: Guyue Huang <[email protected]> * Revert mixtral config change Signed-off-by: Guyue Huang <[email protected]> * Decide cuda device max connections based on torch.cuda.get_device_capability Signed-off-by: Guyue Huang <[email protected]> * Rename custom_cuda_device_max_connections to num_cuda_device_max_connections Signed-off-by: Guyue Huang <[email protected]> * Apply isort and black reformatting Signed-off-by: guyueh1 <[email protected]> * Remove explicit config of align_param_gather in mixtral recipe and use default * Revert "Remove explicit config of align_param_gather in mixtral recipe and use default" This reverts commit 2b8114b. * Rename ub config; change proj to ring exchange for nemotron 340b Signed-off-by: Guyue Huang <[email protected]> * Update the logic to set cuda_device_max_connections Signed-off-by: Guyue Huang <[email protected]> * Revert changes to PerfEnvPlugin Signed-off-by: Guyue Huang <[email protected]> * Move setup of CUDA_DEVICE_MAX_CONNECTIONS to MegatronCommOverlapCallback Signed-off-by: Guyue Huang <[email protected]> * Apply isort and black reformatting Signed-off-by: guyueh1 <[email protected]> * Add b200 tp overlap configs for gpt3 and llama3 models Signed-off-by: Guyue Huang <[email protected]> * Revert changes to nemotron recipe; will put those changes in performance scripts in a separate PR Signed-off-by: Guyue Huang <[email protected]> * Add two docstrings Signed-off-by: Guyue Huang <[email protected]> * Fix os.environ.pop Signed-off-by: Guyue Huang <[email protected]> * Add logging when setting CUDA_DEVICE_MAX_CONNECTIONS Signed-off-by: Guyue Huang <[email protected]> * Fix pylint and flake8 Signed-off-by: Guyue Huang <[email protected]> --------- Signed-off-by: Guyue Huang <[email protected]> Signed-off-by: Guyue Huang <[email protected]> Signed-off-by: guyueh1 <[email protected]> Co-authored-by: Guyue Huang <[email protected]> Co-authored-by: guyueh1 <[email protected]> Co-authored-by: Sangkug Lym <[email protected]> Co-authored-by: Guyue Huang <[email protected]>
* [Nemo2] allow setting CUDA_DEVICE_MAX_CONNECTIONS Signed-off-by: Guyue Huang <[email protected]> Conflicts: nemo/lightning/run/plugins.py * Add a tp2 ub config Signed-off-by: Guyue Huang <[email protected]> * Recipe tuning for mixtral, nemotron4 Signed-off-by: Guyue Huang <[email protected]> * Revert mixtral config change Signed-off-by: Guyue Huang <[email protected]> * Decide cuda device max connections based on torch.cuda.get_device_capability Signed-off-by: Guyue Huang <[email protected]> * Rename custom_cuda_device_max_connections to num_cuda_device_max_connections Signed-off-by: Guyue Huang <[email protected]> * Apply isort and black reformatting Signed-off-by: guyueh1 <[email protected]> * Remove explicit config of align_param_gather in mixtral recipe and use default * Revert "Remove explicit config of align_param_gather in mixtral recipe and use default" This reverts commit 2b8114b. * Rename ub config; change proj to ring exchange for nemotron 340b Signed-off-by: Guyue Huang <[email protected]> * Update the logic to set cuda_device_max_connections Signed-off-by: Guyue Huang <[email protected]> * Revert changes to PerfEnvPlugin Signed-off-by: Guyue Huang <[email protected]> * Move setup of CUDA_DEVICE_MAX_CONNECTIONS to MegatronCommOverlapCallback Signed-off-by: Guyue Huang <[email protected]> * Apply isort and black reformatting Signed-off-by: guyueh1 <[email protected]> * Add b200 tp overlap configs for gpt3 and llama3 models Signed-off-by: Guyue Huang <[email protected]> * Revert changes to nemotron recipe; will put those changes in performance scripts in a separate PR Signed-off-by: Guyue Huang <[email protected]> * Add two docstrings Signed-off-by: Guyue Huang <[email protected]> * Fix os.environ.pop Signed-off-by: Guyue Huang <[email protected]> * Add logging when setting CUDA_DEVICE_MAX_CONNECTIONS Signed-off-by: Guyue Huang <[email protected]> * Fix pylint and flake8 Signed-off-by: Guyue Huang <[email protected]> --------- Signed-off-by: Guyue Huang <[email protected]> Signed-off-by: Guyue Huang <[email protected]> Signed-off-by: guyueh1 <[email protected]> Co-authored-by: Guyue Huang <[email protected]> Co-authored-by: guyueh1 <[email protected]> Co-authored-by: Sangkug Lym <[email protected]> Co-authored-by: Guyue Huang <[email protected]> Signed-off-by: Youngeun Kwon <[email protected]>
What does this PR do ?
Recipe changes for performance in 25.01 release
Collection: [Note which collection this PR will affect]
Changelog
Usage
# Add a code snippet demonstrating how to use this
GitHub Actions CI
The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.
The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information