Recipe changes for performance #11763

guyueh1 · 2025-01-06T18:04:49Z

What does this PR do ?

Recipe changes for performance in 25.01 release

Collection: [Note which collection this PR will affect]

Changelog

Add specific line by line info of high level changes in this PR.

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

Signed-off-by: Guyue Huang <[email protected]> Conflicts: nemo/lightning/run/plugins.py

Signed-off-by: Guyue Huang <[email protected]>

nemo/lightning/run/plugins.py

Signed-off-by: Guyue Huang <[email protected]>

…ability Signed-off-by: Guyue Huang <[email protected]>

…ections Signed-off-by: Guyue Huang <[email protected]>

Conflicts: nemo/lightning/run/plugins.py

Signed-off-by: guyueh1 <[email protected]>

erhoo82

LGTM

…e default

…e_for_25.01

nemo/collections/llm/recipes/mixtral_8x7b.py

nemo/collections/llm/recipes/tp_overlap_configs/userbuffers.py

…e and use default" This reverts commit 2b8114b.

Signed-off-by: Guyue Huang <[email protected]>

erhoo82 · 2025-01-09T17:19:41Z

nemo/collections/llm/recipes/tp_overlap_configs/userbuffers.py

@@ -168,3 +182,17 @@ class TransformerLayerTPOverlapCfg:
    proj_fprop=PipelineOverlapCfg(num_sm=24, cga_size=2, num_splits=4, set_sm_margin=True, fp8_buf=True),
    fc2_fprop=RingExchangeOverlapCfg(num_sm=1, set_sm_margin=True),
 )
+
+# Nemotron 340B
+userbuffers_bf16_h100_h18432_tp8_mbs1_seqlen4096 = TransformerLayerTPOverlapCfg(


Is this an overlap config for hopper or blackwell?

guyueh1 · 2025-01-14T19:50:01Z

nemo/lightning/run/plugins.py

-            if tp_size > 1 or cp_size > 1:
-                executor.env_vars["CUDA_DEVICE_MAX_CONNECTIONS"] = "1"
+            if torch.cuda.is_available():
+                major, _ = torch.cuda.get_device_capability()


@erhoo82 This method won't work because it's run on the cluster frontend node, not after slurm allocation. We need to found another way.

Signed-off-by: Guyue Huang <[email protected]>

erhoo82 · 2025-01-29T22:00:07Z

nemo/lightning/pytorch/callbacks/megatron_comm_overlap.py

+                os.environ.pop('CUDA_DEVICE_MAX_CONNECTIONS')
+        else:
+            if tp_size > 1 or cp_size > 1:
+                os.environ['CUDA_DEVICE_MAX_CONNECTIONS'] = "1"


It could also be good to add a doc string for this condition.
Set the device connection to 1 to enforce the kernel queuing order from the host to the execution order on GPU. This is needed to schedule a communication kernel before the overlapping persistent GEMM kernel. Otherwise, the communication kernel will be pushed to the end of the GEMM kernel so failing to overlap the kernels

erhoo82 · 2025-01-29T22:49:58Z

nemo/lightning/pytorch/callbacks/megatron_comm_overlap.py

+        if major > 9:
+            if (tp_size > 1 or cp_size > 1) and (dp_size > 1 or pp_size > 1):
+                # Default is 8, but for this case, we need extra connections
+                # to avoid serialization of streams


Default is 8, but for this case, we need extra connections to avoid serialization of streams
to
We need extra connections to avoid serialization of streams, so we use the max connections of 32 instead of the default device connection of 8.

Signed-off-by: Guyue Huang <[email protected]>

erhoo82

LGTM

Signed-off-by: Guyue Huang <[email protected]>

github-actions · 2025-01-31T16:32:41Z

beep boop 🤖: 🙏 The following files have warnings. In case you are familiar with these, please try helping us to improve the code base.

Your code was analyzed with PyLint. The following annotations have been identified:

************* Module nemo.collections.llm.recipes.tp_overlap_configs.userbuffers
nemo/collections/llm/recipes/tp_overlap_configs/userbuffers.py:19:0: C0115: Missing class docstring (missing-class-docstring)
nemo/collections/llm/recipes/tp_overlap_configs/userbuffers.py:24:0: C0115: Missing class docstring (missing-class-docstring)
nemo/collections/llm/recipes/tp_overlap_configs/userbuffers.py:34:0: C0115: Missing class docstring (missing-class-docstring)
nemo/collections/llm/recipes/tp_overlap_configs/userbuffers.py:42:0: C0115: Missing class docstring (missing-class-docstring)
nemo/collections/llm/recipes/tp_overlap_configs/userbuffers.py:50:0: C0115: Missing class docstring (missing-class-docstring)
************* Module nemo.lightning.pytorch.callbacks.megatron_comm_overlap
nemo/lightning/pytorch/callbacks/megatron_comm_overlap.py:81:0: C0301: Line too long (121/119) (line-too-long)
nemo/lightning/pytorch/callbacks/megatron_comm_overlap.py:291:0: C0301: Line too long (124/119) (line-too-long)
nemo/lightning/pytorch/callbacks/megatron_comm_overlap.py:255:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/lightning/pytorch/callbacks/megatron_comm_overlap.py:322:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/lightning/pytorch/callbacks/megatron_comm_overlap.py:326:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/lightning/pytorch/callbacks/megatron_comm_overlap.py:330:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/lightning/pytorch/callbacks/megatron_comm_overlap.py:334:4: C0116: Missing function or method docstring (missing-function-docstring)

-----------------------------------
Your code has been rated at 9.45/10

Mitigation guide:

Add sensible and useful docstrings to functions and methods
For trivial methods like getter/setters, consider adding # pylint: disable=C0116 inside the function itself
To disable multiple functions/methods at once, put a # pylint: disable=C0116 before the first and a # pylint: enable=C0116 after the last.

By applying these rules, we reduce the occurance of this message in future.

Thank you for improving NeMo's documentation!

guyueh1 · 2025-02-03T16:55:45Z

@erhoo82 I pushed some minor changes (add debug logging), need approve again

Signed-off-by: Guyue Huang <[email protected]>

codecov-commenter · 2025-02-05T08:22:44Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 30.47%. Comparing base (09186c3) to head (329559a).
Report is 12 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #11763      +/-   ##
==========================================
+ Coverage   30.30%   30.47%   +0.17%     
==========================================
  Files        1387     1395       +8     
  Lines      176283   177041     +758     
  Branches    27091    27147      +56     
==========================================
+ Hits        53423    53956     +533     
- Misses     118776   118975     +199     
- Partials     4084     4110      +26

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

guyueh1 · 2025-02-05T16:34:10Z

Tests passed except for one optional test (T5 training, seems the assertion error is irrelevant to my changes).
@erhoo82 what's the next step?

* [Nemo2] allow setting CUDA_DEVICE_MAX_CONNECTIONS Signed-off-by: Guyue Huang <[email protected]> Conflicts: nemo/lightning/run/plugins.py * Add a tp2 ub config Signed-off-by: Guyue Huang <[email protected]> * Recipe tuning for mixtral, nemotron4 Signed-off-by: Guyue Huang <[email protected]> * Revert mixtral config change Signed-off-by: Guyue Huang <[email protected]> * Decide cuda device max connections based on torch.cuda.get_device_capability Signed-off-by: Guyue Huang <[email protected]> * Rename custom_cuda_device_max_connections to num_cuda_device_max_connections Signed-off-by: Guyue Huang <[email protected]> * Apply isort and black reformatting Signed-off-by: guyueh1 <[email protected]> * Remove explicit config of align_param_gather in mixtral recipe and use default * Revert "Remove explicit config of align_param_gather in mixtral recipe and use default" This reverts commit 2b8114b. * Rename ub config; change proj to ring exchange for nemotron 340b Signed-off-by: Guyue Huang <[email protected]> * Update the logic to set cuda_device_max_connections Signed-off-by: Guyue Huang <[email protected]> * Revert changes to PerfEnvPlugin Signed-off-by: Guyue Huang <[email protected]> * Move setup of CUDA_DEVICE_MAX_CONNECTIONS to MegatronCommOverlapCallback Signed-off-by: Guyue Huang <[email protected]> * Apply isort and black reformatting Signed-off-by: guyueh1 <[email protected]> * Add b200 tp overlap configs for gpt3 and llama3 models Signed-off-by: Guyue Huang <[email protected]> * Revert changes to nemotron recipe; will put those changes in performance scripts in a separate PR Signed-off-by: Guyue Huang <[email protected]> * Add two docstrings Signed-off-by: Guyue Huang <[email protected]> * Fix os.environ.pop Signed-off-by: Guyue Huang <[email protected]> * Add logging when setting CUDA_DEVICE_MAX_CONNECTIONS Signed-off-by: Guyue Huang <[email protected]> * Fix pylint and flake8 Signed-off-by: Guyue Huang <[email protected]> --------- Signed-off-by: Guyue Huang <[email protected]> Signed-off-by: Guyue Huang <[email protected]> Signed-off-by: guyueh1 <[email protected]> Co-authored-by: Guyue Huang <[email protected]> Co-authored-by: guyueh1 <[email protected]> Co-authored-by: Sangkug Lym <[email protected]> Co-authored-by: Guyue Huang <[email protected]>

pablo-garay · 2025-02-05T18:42:28Z

LGTM

* [Nemo2] allow setting CUDA_DEVICE_MAX_CONNECTIONS Signed-off-by: Guyue Huang <[email protected]> Conflicts: nemo/lightning/run/plugins.py * Add a tp2 ub config Signed-off-by: Guyue Huang <[email protected]> * Recipe tuning for mixtral, nemotron4 Signed-off-by: Guyue Huang <[email protected]> * Revert mixtral config change Signed-off-by: Guyue Huang <[email protected]> * Decide cuda device max connections based on torch.cuda.get_device_capability Signed-off-by: Guyue Huang <[email protected]> * Rename custom_cuda_device_max_connections to num_cuda_device_max_connections Signed-off-by: Guyue Huang <[email protected]> * Apply isort and black reformatting Signed-off-by: guyueh1 <[email protected]> * Remove explicit config of align_param_gather in mixtral recipe and use default * Revert "Remove explicit config of align_param_gather in mixtral recipe and use default" This reverts commit 2b8114b. * Rename ub config; change proj to ring exchange for nemotron 340b Signed-off-by: Guyue Huang <[email protected]> * Update the logic to set cuda_device_max_connections Signed-off-by: Guyue Huang <[email protected]> * Revert changes to PerfEnvPlugin Signed-off-by: Guyue Huang <[email protected]> * Move setup of CUDA_DEVICE_MAX_CONNECTIONS to MegatronCommOverlapCallback Signed-off-by: Guyue Huang <[email protected]> * Apply isort and black reformatting Signed-off-by: guyueh1 <[email protected]> * Add b200 tp overlap configs for gpt3 and llama3 models Signed-off-by: Guyue Huang <[email protected]> * Revert changes to nemotron recipe; will put those changes in performance scripts in a separate PR Signed-off-by: Guyue Huang <[email protected]> * Add two docstrings Signed-off-by: Guyue Huang <[email protected]> * Fix os.environ.pop Signed-off-by: Guyue Huang <[email protected]> * Add logging when setting CUDA_DEVICE_MAX_CONNECTIONS Signed-off-by: Guyue Huang <[email protected]> * Fix pylint and flake8 Signed-off-by: Guyue Huang <[email protected]> --------- Signed-off-by: Guyue Huang <[email protected]> Signed-off-by: Guyue Huang <[email protected]> Signed-off-by: guyueh1 <[email protected]> Co-authored-by: Guyue Huang <[email protected]> Co-authored-by: guyueh1 <[email protected]> Co-authored-by: Sangkug Lym <[email protected]> Co-authored-by: Guyue Huang <[email protected]>

* [Nemo2] allow setting CUDA_DEVICE_MAX_CONNECTIONS Signed-off-by: Guyue Huang <[email protected]> Conflicts: nemo/lightning/run/plugins.py * Add a tp2 ub config Signed-off-by: Guyue Huang <[email protected]> * Recipe tuning for mixtral, nemotron4 Signed-off-by: Guyue Huang <[email protected]> * Revert mixtral config change Signed-off-by: Guyue Huang <[email protected]> * Decide cuda device max connections based on torch.cuda.get_device_capability Signed-off-by: Guyue Huang <[email protected]> * Rename custom_cuda_device_max_connections to num_cuda_device_max_connections Signed-off-by: Guyue Huang <[email protected]> * Apply isort and black reformatting Signed-off-by: guyueh1 <[email protected]> * Remove explicit config of align_param_gather in mixtral recipe and use default * Revert "Remove explicit config of align_param_gather in mixtral recipe and use default" This reverts commit 2b8114b. * Rename ub config; change proj to ring exchange for nemotron 340b Signed-off-by: Guyue Huang <[email protected]> * Update the logic to set cuda_device_max_connections Signed-off-by: Guyue Huang <[email protected]> * Revert changes to PerfEnvPlugin Signed-off-by: Guyue Huang <[email protected]> * Move setup of CUDA_DEVICE_MAX_CONNECTIONS to MegatronCommOverlapCallback Signed-off-by: Guyue Huang <[email protected]> * Apply isort and black reformatting Signed-off-by: guyueh1 <[email protected]> * Add b200 tp overlap configs for gpt3 and llama3 models Signed-off-by: Guyue Huang <[email protected]> * Revert changes to nemotron recipe; will put those changes in performance scripts in a separate PR Signed-off-by: Guyue Huang <[email protected]> * Add two docstrings Signed-off-by: Guyue Huang <[email protected]> * Fix os.environ.pop Signed-off-by: Guyue Huang <[email protected]> * Add logging when setting CUDA_DEVICE_MAX_CONNECTIONS Signed-off-by: Guyue Huang <[email protected]> * Fix pylint and flake8 Signed-off-by: Guyue Huang <[email protected]> --------- Signed-off-by: Guyue Huang <[email protected]> Signed-off-by: Guyue Huang <[email protected]> Signed-off-by: guyueh1 <[email protected]> Co-authored-by: Guyue Huang <[email protected]> Co-authored-by: guyueh1 <[email protected]> Co-authored-by: Sangkug Lym <[email protected]> Co-authored-by: Guyue Huang <[email protected]> Signed-off-by: Youngeun Kwon <[email protected]>

guyueh1 and others added 3 commits January 6, 2025 09:33

[Nemo2] allow setting CUDA_DEVICE_MAX_CONNECTIONS

f069a19

Signed-off-by: Guyue Huang <[email protected]> Conflicts: nemo/lightning/run/plugins.py

Add a tp2 ub config

87f1d43

Signed-off-by: Guyue Huang <[email protected]>

Recipe tuning for mixtral, nemotron4

3bdda64

Signed-off-by: Guyue Huang <[email protected]>

erhoo82 reviewed Jan 8, 2025

View reviewed changes

nemo/lightning/run/plugins.py Outdated Show resolved Hide resolved

nemo/lightning/run/plugins.py Outdated Show resolved Hide resolved

guyueh1 added 3 commits January 8, 2025 10:56

Revert mixtral config change

43f45fa

Signed-off-by: Guyue Huang <[email protected]>

Decide cuda device max connections based on torch.cuda.get_device_cap…

8420b22

…ability Signed-off-by: Guyue Huang <[email protected]>

Rename custom_cuda_device_max_connections to num_cuda_device_max_conn…

633b903

…ections Signed-off-by: Guyue Huang <[email protected]>

guyueh1 marked this pull request as ready for review January 8, 2025 19:06

guyueh1 and others added 2 commits January 8, 2025 11:07

Merge branch 'main' into recipe_for_25.01

88c16c3

Conflicts: nemo/lightning/run/plugins.py

Apply isort and black reformatting

5ca96db

Signed-off-by: guyueh1 <[email protected]>

erhoo82 previously approved these changes Jan 8, 2025

View reviewed changes

erhoo82 added r2.1.1 Run CICD labels Jan 8, 2025

guyueh1 added 2 commits January 8, 2025 16:52

Remove explicit config of align_param_gather in mixtral recipe and us…

2b8114b

…e default

Merge branch 'recipe_for_25.01' of github.com:guyueh1/NeMo into recip…

93cb713

…e_for_25.01

guyueh1 dismissed erhoo82’s stale review via 93cb713 January 9, 2025 00:53

erhoo82 reviewed Jan 9, 2025

View reviewed changes

nemo/collections/llm/recipes/mixtral_8x7b.py Outdated Show resolved Hide resolved

nemo/collections/llm/recipes/tp_overlap_configs/userbuffers.py Show resolved Hide resolved

guyueh1 added 2 commits January 8, 2025 19:27

Revert "Remove explicit config of align_param_gather in mixtral recip…

7c5530b

…e and use default" This reverts commit 2b8114b.

Rename ub config; change proj to ring exchange for nemotron 340b

e234588

Signed-off-by: Guyue Huang <[email protected]>

erhoo82 self-requested a review January 14, 2025 19:22

erhoo82 previously approved these changes Jan 14, 2025

View reviewed changes

erhoo82 enabled auto-merge (squash) January 14, 2025 19:23

guyueh1 commented Jan 14, 2025

View reviewed changes

Merge branch 'main' into recipe_for_25.01

43d6e12

erhoo82 added Run CICD and removed Run CICD labels Jan 14, 2025

Update the logic to set cuda_device_max_connections

9d5cb11

Signed-off-by: Guyue Huang <[email protected]>

auto-merge was automatically disabled January 15, 2025 17:58
Head branch was pushed to by a user without write access

guyueh1 dismissed erhoo82’s stale review via 9d5cb11 January 15, 2025 17:58

Revert changes to PerfEnvPlugin

0fd838e

Signed-off-by: Guyue Huang <[email protected]>

erhoo82 reviewed Jan 29, 2025

View reviewed changes

Add two docstrings

544cd5a

Signed-off-by: Guyue Huang <[email protected]>

erhoo82 previously approved these changes Jan 29, 2025

View reviewed changes

Merge branch 'main' into recipe_for_25.01

83d35d5

erhoo82 added Run CICD and removed Run CICD labels Jan 29, 2025

Fix os.environ.pop

530719a

Signed-off-by: Guyue Huang <[email protected]>

guyueh1 dismissed erhoo82’s stale review via 530719a January 30, 2025 21:59

guyueh1 requested a review from erhoo82 January 30, 2025 23:45

Add logging when setting CUDA_DEVICE_MAX_CONNECTIONS

24771a0

Signed-off-by: Guyue Huang <[email protected]>

erhoo82 previously approved these changes Feb 4, 2025

View reviewed changes

erhoo82 enabled auto-merge (squash) February 4, 2025 20:12

Merge branch 'main' into recipe_for_25.01

300e959

erhoo82 added Run CICD and removed Run CICD labels Feb 4, 2025

Fix pylint and flake8

329559a

Signed-off-by: Guyue Huang <[email protected]>

auto-merge was automatically disabled February 4, 2025 21:22
Head branch was pushed to by a user without write access

guyueh1 dismissed erhoo82’s stale review via 329559a February 4, 2025 21:22

erhoo82 added Run CICD and removed Run CICD labels Feb 4, 2025

erhoo82 approved these changes Feb 5, 2025

View reviewed changes

ko3n1g merged commit 6db2b32 into NVIDIA:main Feb 5, 2025
228 of 229 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recipe changes for performance #11763

Recipe changes for performance #11763

guyueh1 commented Jan 6, 2025

erhoo82 left a comment

erhoo82 Jan 9, 2025

guyueh1 Jan 14, 2025

erhoo82 Jan 29, 2025

erhoo82 Jan 29, 2025

erhoo82 left a comment

github-actions bot commented Jan 31, 2025

guyueh1 commented Feb 3, 2025

codecov-commenter commented Feb 5, 2025

guyueh1 commented Feb 5, 2025

pablo-garay commented Feb 5, 2025

Recipe changes for performance #11763

Recipe changes for performance #11763

Conversation

guyueh1 commented Jan 6, 2025

What does this PR do ?

Changelog

Usage

GitHub Actions CI

Before your PR is "Ready for review"

Who can review?

Additional Information

erhoo82 left a comment

Choose a reason for hiding this comment

erhoo82 Jan 9, 2025

Choose a reason for hiding this comment

guyueh1 Jan 14, 2025

Choose a reason for hiding this comment

erhoo82 Jan 29, 2025

Choose a reason for hiding this comment

erhoo82 Jan 29, 2025

Choose a reason for hiding this comment

erhoo82 left a comment

Choose a reason for hiding this comment

github-actions bot commented Jan 31, 2025

guyueh1 commented Feb 3, 2025

codecov-commenter commented Feb 5, 2025

Codecov Report

guyueh1 commented Feb 5, 2025

pablo-garay commented Feb 5, 2025