Skip to content

Commit 4ab67f5

Browse files
authored
Cherry-pick recent docs fixes to release/rocm-rel-6.3 (#1453)
* Refactor landing page and move some info to What is RCCL (#1415) (cherry picked from commit 2d07f18) * Refactor RCCL install guide into several pages (#1427) * Refactor RCCL install guide into several pages * Changes from code review and new docker guide * Add missing entries to ToC * Minor fixes * Fix help strings * Edits after review and remove extra white space (cherry picked from commit bf7c130) * Update rccl changelog for 6.3.1 (#1433) * Update rccl changelog for 6.3.1 * Fix version number * Correct RCCL release version * Added details to 6.3.0 changelog --------- Co-authored-by: corey-derochie-amd <[email protected]> (cherry picked from commit e42f10a) * Modify cmake instruction in build from source (#1445) (cherry picked from commit 28594b2) * Add RCCL debugging guide (#1420) * Add RCCL debugging guide * Changes from external review * More edits from internal review * Additional edits * Minor correction * More changes after external review * Integrate index and ToC changes with incoming merge changes * Integrate feedback from management review * Minor edits from the internal review (cherry picked from commit 6d34fb7)
1 parent eef7b29 commit 4ab67f5

11 files changed

+623
-176
lines changed

CHANGELOG.md

Lines changed: 14 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,16 +2,28 @@
22

33
Full documentation for RCCL is available at [https://rccl.readthedocs.io](https://rccl.readthedocs.io)
44

5+
## RCCL 2.21.5 for ROCm 6.3.1
6+
7+
### Added
8+
9+
### Changed
10+
11+
* Enhanced user documentation
12+
13+
### Resolved issues
14+
15+
* Corrected user help strings in `install.sh`
16+
517
## RCCL 2.21.5 for ROCm 6.3.0
618

719
### Added
820

9-
* MSCCL++ integration for specific contexts
21+
* MSCCL++ integration for AllReduce and AllGather on gfx942
1022
* Performance collection to rccl_replayer
1123
* Tuner Plugin example for MI300
1224
* Tuning table for large number of nodes
1325
* Support for amdclang++
14-
* New Rome model
26+
* Allow NIC ID remapping using `NCCL_RINGS_REMAP` environment variable
1527

1628
### Changed
1729

README.md

Lines changed: 1 addition & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -81,7 +81,7 @@ $ git submodule update --init --recursive --depth=1
8181
```
8282
You may substitute an installation path of your own choosing by passing `CMAKE_INSTALL_PREFIX`. For example:
8383
```shell
84-
$ cmake -DCMAKE_INSTALL_PREFIX=$PWD/rccl-install ..
84+
$ cmake -DCMAKE_INSTALL_PREFIX=$PWD/rccl-install -DCMAKE_BUILD_TYPE=Release ..
8585
```
8686
Note: ensure rocm-cmake is installed, `apt install rocm-cmake`.
8787
@@ -127,11 +127,6 @@ $ mpirun --allow-run-as-root -np 8 --mca pml ucx --mca btl ^openib -x NCCL_DEBUG
127127
128128
For more information on rccl-tests options, refer to the [Usage](https://github.com/ROCm/rccl-tests#usage) section of rccl-tests.
129129
130-
131-
## Enabling peer-to-peer transport
132-
133-
In order to enable peer-to-peer access on machines with PCIe-connected GPUs, the HSA environment variable `HSA_FORCE_FINE_GRAIN_PCIE=1` is required to be set, on top of requiring GPUs that support peer-to-peer access and proper large BAR addressing support.
134-
135130
## Tests
136131
137132
There are rccl unit tests implemented with the Googletest framework in RCCL. The rccl unit tests require Googletest 1.10 or higher to build and execute properly (installed with the -d option to install.sh).
@@ -152,31 +147,6 @@ will run only AllReduce correctness tests with float16 datatype. A list of avail
152147
There are also other performance and error-checking tests for RCCL. These are maintained separately at https://github.com/ROCm/rccl-tests.
153148
See the rccl-tests README for more information on how to build and run those tests.
154149
155-
## NPKit
156-
157-
RCCL integrates [NPKit](https://github.com/microsoft/npkit), a profiler framework that enables collecting fine-grained trace events in RCCL components, especially in giant collective GPU kernels.
158-
159-
Please check [NPKit sample workflow for RCCL](https://github.com/microsoft/NPKit/tree/main/rccl_samples) as a fully automated usage example. It also provides good templates for the following manual instructions.
160-
161-
To manually build RCCL with NPKit enabled, pass `-DNPKIT_FLAGS="-DENABLE_NPKIT -DENABLE_NPKIT_...(other NPKit compile-time switches)"` with cmake command. All NPKit compile-time switches are declared in the RCCL code base as macros with prefix `ENABLE_NPKIT_`, and they control which information will be collected. Also note that currently NPKit only supports collecting non-overlapped events on GPU, and `-DNPKIT_FLAGS` should follow this rule.
162-
163-
To manually run RCCL with NPKit enabled, environment variable `NPKIT_DUMP_DIR` needs to be set as the NPKit event dump directory. Also note that currently NPKit only supports 1 GPU per process.
164-
165-
To manually analyze NPKit dump results, please leverage [npkit_trace_generator.py](https://github.com/microsoft/NPKit/blob/main/rccl_samples/npkit_trace_generator.py).
166-
167-
## MSCCL/MSCCL++
168-
RCCL integrates [MSCCL](https://github.com/Azure/msccl) and [MSCCL++](https://github.com/microsoft/mscclpp) to leverage the highly efficient GPU-GPU communication primitives for collective operations. Thanks to Microsoft Corporation for collaborating with us in this project.
169-
170-
MSCCL uses XMLs for different collective algorithms on different architectures. RCCL collectives can leverage those algorithms once the corresponding XML has been provided by the user. The XML files contain the sequence of send-recv and reduction operations to be executed by the kernel. On MI300X, MSCCL is enabled by default. On other platforms, the users may have to enable this by setting `RCCL_MSCCL_FORCE_ENABLE=1`. By default, MSCCL will only be used if every rank belongs to a unique process; to disable this restriction for multi-threaded or single-threaded configurations, set `RCCL_MSCCL_ENABLE_SINGLE_PROCESS=1`.
171-
172-
On the other hand, RCCL allreduce and allgather collectives can leverage the efficient MSCCL++ communication kernels for certain message sizes. MSCCL++ support is available whenever MSCCL support is available. Users need to set the RCCL environment variable `RCCL_MSCCLPP_ENABLE=1` to run RCCL workload with MSCCL++ support. It is also possible to set the message size threshold for using MSCCL++ by using the environment variable `RCCL_MSCCLPP_THRESHOLD`. Once `RCCL_MSCCLPP_THRESHOLD` (the default value is 1MB) is set, RCCL will invoke MSCCL++ kernels for all message sizes less than or equal to the specified threshold.
173-
174-
If some restrictions are not met, it will fall back to MSCCL or RCCL. The following are restrictions on using MSCCL++:
175-
- Message size must be a non-zero multiple of 32 bytes
176-
- Does not support `hipMallocManaged` buffers
177-
- Allreduce only supports `float16`, `int32`, `uint32`, `float32`, and `bfloat16` data types
178-
- Allreduce only supports the `sum` op
179-
180150
## Library and API Documentation
181151
182152
Please refer to the [RCCL Documentation Site](https://rocm.docs.amd.com/projects/rccl/en/latest/) for current documentation.
@@ -191,17 +161,6 @@ pip3 install -r sphinx/requirements.txt
191161
python3 -m sphinx -T -E -b html -d _build/doctrees -D language=en . _build/html
192162
```
193163
194-
### Improving performance on MI300 when using less than 8 GPUs
195-
196-
On a system with 8\*MI300X GPUs, each pair of GPUs are connected with dedicated XGMI links in a fully-connected topology. So, for collective operations, one can achieve good performance when all 8 GPUs (and all XGMI links) are used. When using less than 8 GPUs, one can only achieve a fraction of the potential bandwidth on the system.
197-
198-
But, if your workload warrants using less than 8 MI300 GPUs on a system, you can set the run-time variable `NCCL_MIN_NCHANNELS` to increase the number of channels.\
199-
E.g.: `export NCCL_MIN_NCHANNELS=32`
200-
201-
Increasing the number of channels can be beneficial to performance, but it also increases GPU utilization for collective operations.
202-
203-
Additionally, we have pre-defined higher number of channels when using only 2 GPUs or 4 GPUs on a 8\*MI300 system. Here, RCCL will use **32 channels** for the 2 MI300 GPUs scenario and **24 channels** for the 4 MI300 GPUs scenario.
204-
205164
## Copyright
206165
207166
All source code and accompanying documentation is copyright (c) 2015-2022, NVIDIA CORPORATION. All rights reserved.

docs/how-to/rccl-usage-tips.rst

Lines changed: 103 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,103 @@
1+
.. meta::
2+
:description: Usage tips for the RCCL library of collective communication primitives
3+
:keywords: RCCL, ROCm, library, API, peer-to-peer, transport
4+
5+
.. _rccl-usage-tips:
6+
7+
8+
*****************************************
9+
RCCL usage tips
10+
*****************************************
11+
12+
This topic describes some of the more common RCCL extensions, such as NPKit and MSCCL, and provides tips on how to
13+
configure and customize the application.
14+
15+
NPKit
16+
=====
17+
18+
RCCL integrates `NPKit <https://github.com/microsoft/npkit>`_, a profiler framework that
19+
enables the collection of fine-grained trace events in RCCL components, especially in giant collective GPU kernels.
20+
See the `NPKit sample workflow for RCCL <https://github.com/microsoft/NPKit/tree/main/rccl_samples>`_ for
21+
a fully-automated usage example. It also provides useful templates for the following manual instructions.
22+
23+
To manually build RCCL with NPKit enabled, pass ``-DNPKIT_FLAGS="-DENABLE_NPKIT -DENABLE_NPKIT_...(other NPKit compile-time switches)"`` to the ``cmake`` command.
24+
All NPKit compile-time switches are declared in the RCCL code base as macros with the prefix ``ENABLE_NPKIT_``.
25+
These switches control the information that is collected.
26+
27+
.. note::
28+
29+
NPKit only supports the collection of non-overlapped events on the GPU.
30+
The ``-DNPKIT_FLAGS`` settings must follow this rule.
31+
32+
To manually run RCCL with NPKit enabled, set the environment variable ``NPKIT_DUMP_DIR``
33+
to the NPKit event dump directory. NPKit only supports one GPU per process.
34+
To manually analyze the NPKit dump results, use `npkit_trace_generator.py <https://github.com/microsoft/NPKit/blob/main/rccl_samples/npkit_trace_generator.py>`_.
35+
36+
MSCCL/MSCCL++
37+
=============
38+
39+
RCCL integrates `MSCCL <https://github.com/microsoft/msccl>`_ and `MSCCL++ <https://github.com/microsoft/mscclpp>`_ to
40+
leverage these highly efficient GPU-GPU communication primitives for collective operations.
41+
Microsoft Corporation collaborated with AMD for this project.
42+
43+
MSCCL uses XMLs for different collective algorithms on different architectures.
44+
RCCL collectives can leverage these algorithms after the user provides the corresponding XML.
45+
The XML files contain sequences of send-recv and reduction operations for the kernel to run.
46+
47+
MSCCL is enabled by default on the AMD Instinct™ MI300X accelerator. On other platforms, users might have to enable it
48+
using the setting ``RCCL_MSCCL_FORCE_ENABLE=1``. By default, MSCCL is only used if every rank belongs
49+
to a unique process. To disable this restriction for multi-threaded or single-threaded configurations,
50+
use the setting ``RCCL_MSCCL_ENABLE_SINGLE_PROCESS=1``.
51+
52+
RCCL allreduce and allgather collectives can leverage the efficient MSCCL++ communication kernels
53+
for certain message sizes. MSCCL++ support is available whenever MSCCL support is available.
54+
To run a RCCL workload with MSCCL++ support, set the following RCCL environment variable:
55+
56+
.. code-block:: shell
57+
58+
RCCL_MSCCLPP_ENABLE=1
59+
60+
To set the message size threshold for using MSCCL++, use the environment variable ``RCCL_MSCCLPP_THRESHOLD``,
61+
which has a default value of 1MB. After ``RCCL_MSCCLPP_THRESHOLD`` has been set,
62+
RCCL invokes MSCCL++ kernels for all message sizes less than or equal to the specified threshold.
63+
64+
The following restrictions apply when using MSCCL++. If these restrictions are not met,
65+
operations fall back to using MSCCL or RCCL.
66+
67+
* The message size must be a non-zero multiple of 32 bytes
68+
* It does not support ``hipMallocManaged`` buffers
69+
* Allreduce only supports the ``float16``, ``int32``, ``uint32``, ``float32``, and ``bfloat16`` data types
70+
* Allreduce only supports the sum operation
71+
72+
Enabling peer-to-peer transport
73+
===============================
74+
75+
To enable peer-to-peer access on machines with PCIe-connected GPUs,
76+
set the HSA environment variable as follows:
77+
78+
.. code-block:: shell
79+
80+
HSA_FORCE_FINE_GRAIN_PCIE=1
81+
82+
This feature requires GPUs that support peer-to-peer access along with
83+
proper large BAR addressing support.
84+
85+
Improving performance on the MI300X accelerator when using fewer than 8 GPUs
86+
============================================================================
87+
88+
On a system with 8\*MI300X accelerators, each pair of accelerators is connected with dedicated XGMI links
89+
in a fully-connected topology. For collective operations, this can achieve good performance when
90+
all 8 accelerators (and all XGMI links) are used. When fewer than 8 GPUs are used, however, this can only achieve a fraction
91+
of the potential bandwidth on the system.
92+
However, if your workload warrants using fewer than 8 MI300X accelerators on a system,
93+
you can set the run-time variable ``NCCL_MIN_NCHANNELS`` to increase the number of channels. For example:
94+
95+
.. code-block:: shell
96+
97+
export NCCL_MIN_NCHANNELS=32
98+
99+
Increasing the number of channels can benefit performance, but it also increases
100+
GPU utilization for collective operations.
101+
Additionally, RCCL pre-defines a higher number of channels when only 2 or
102+
4 accelerators are in use on a 8\*MI300X system. In this situation, RCCL uses 32 channels with two MI300X accelerators
103+
and 24 channels for four MI300X accelerators.

0 commit comments

Comments
 (0)