|
| 1 | +.. meta:: |
| 2 | + :description: Usage tips for the RCCL library of collective communication primitives |
| 3 | + :keywords: RCCL, ROCm, library, API, peer-to-peer, transport |
| 4 | + |
| 5 | +.. _rccl-usage-tips: |
| 6 | + |
| 7 | + |
| 8 | +***************************************** |
| 9 | +RCCL usage tips |
| 10 | +***************************************** |
| 11 | + |
| 12 | +This topic describes some of the more common RCCL extensions, such as NPKit and MSCCL, and provides tips on how to |
| 13 | +configure and customize the application. |
| 14 | + |
| 15 | +NPKit |
| 16 | +===== |
| 17 | + |
| 18 | +RCCL integrates `NPKit <https://github.com/microsoft/npkit>`_, a profiler framework that |
| 19 | +enables the collection of fine-grained trace events in RCCL components, especially in giant collective GPU kernels. |
| 20 | +See the `NPKit sample workflow for RCCL <https://github.com/microsoft/NPKit/tree/main/rccl_samples>`_ for |
| 21 | +a fully-automated usage example. It also provides useful templates for the following manual instructions. |
| 22 | + |
| 23 | +To manually build RCCL with NPKit enabled, pass ``-DNPKIT_FLAGS="-DENABLE_NPKIT -DENABLE_NPKIT_...(other NPKit compile-time switches)"`` to the ``cmake`` command. |
| 24 | +All NPKit compile-time switches are declared in the RCCL code base as macros with the prefix ``ENABLE_NPKIT_``. |
| 25 | +These switches control the information that is collected. |
| 26 | + |
| 27 | +.. note:: |
| 28 | + |
| 29 | + NPKit only supports the collection of non-overlapped events on the GPU. |
| 30 | + The ``-DNPKIT_FLAGS`` settings must follow this rule. |
| 31 | + |
| 32 | +To manually run RCCL with NPKit enabled, set the environment variable ``NPKIT_DUMP_DIR`` |
| 33 | +to the NPKit event dump directory. NPKit only supports one GPU per process. |
| 34 | +To manually analyze the NPKit dump results, use `npkit_trace_generator.py <https://github.com/microsoft/NPKit/blob/main/rccl_samples/npkit_trace_generator.py>`_. |
| 35 | + |
| 36 | +MSCCL/MSCCL++ |
| 37 | +============= |
| 38 | + |
| 39 | +RCCL integrates `MSCCL <https://github.com/microsoft/msccl>`_ and `MSCCL++ <https://github.com/microsoft/mscclpp>`_ to |
| 40 | +leverage these highly efficient GPU-GPU communication primitives for collective operations. |
| 41 | +Microsoft Corporation collaborated with AMD for this project. |
| 42 | + |
| 43 | +MSCCL uses XMLs for different collective algorithms on different architectures. |
| 44 | +RCCL collectives can leverage these algorithms after the user provides the corresponding XML. |
| 45 | +The XML files contain sequences of send-recv and reduction operations for the kernel to run. |
| 46 | + |
| 47 | +MSCCL is enabled by default on the AMD Instinct™ MI300X accelerator. On other platforms, users might have to enable it |
| 48 | +using the setting ``RCCL_MSCCL_FORCE_ENABLE=1``. By default, MSCCL is only used if every rank belongs |
| 49 | +to a unique process. To disable this restriction for multi-threaded or single-threaded configurations, |
| 50 | +use the setting ``RCCL_MSCCL_ENABLE_SINGLE_PROCESS=1``. |
| 51 | + |
| 52 | +RCCL allreduce and allgather collectives can leverage the efficient MSCCL++ communication kernels |
| 53 | +for certain message sizes. MSCCL++ support is available whenever MSCCL support is available. |
| 54 | +To run a RCCL workload with MSCCL++ support, set the following RCCL environment variable: |
| 55 | + |
| 56 | +.. code-block:: shell |
| 57 | +
|
| 58 | + RCCL_MSCCLPP_ENABLE=1 |
| 59 | +
|
| 60 | +To set the message size threshold for using MSCCL++, use the environment variable ``RCCL_MSCCLPP_THRESHOLD``, |
| 61 | +which has a default value of 1MB. After ``RCCL_MSCCLPP_THRESHOLD`` has been set, |
| 62 | +RCCL invokes MSCCL++ kernels for all message sizes less than or equal to the specified threshold. |
| 63 | + |
| 64 | +The following restrictions apply when using MSCCL++. If these restrictions are not met, |
| 65 | +operations fall back to using MSCCL or RCCL. |
| 66 | + |
| 67 | +* The message size must be a non-zero multiple of 32 bytes |
| 68 | +* It does not support ``hipMallocManaged`` buffers |
| 69 | +* Allreduce only supports the ``float16``, ``int32``, ``uint32``, ``float32``, and ``bfloat16`` data types |
| 70 | +* Allreduce only supports the sum operation |
| 71 | + |
| 72 | +Enabling peer-to-peer transport |
| 73 | +=============================== |
| 74 | + |
| 75 | +To enable peer-to-peer access on machines with PCIe-connected GPUs, |
| 76 | +set the HSA environment variable as follows: |
| 77 | + |
| 78 | +.. code-block:: shell |
| 79 | +
|
| 80 | + HSA_FORCE_FINE_GRAIN_PCIE=1 |
| 81 | +
|
| 82 | +This feature requires GPUs that support peer-to-peer access along with |
| 83 | +proper large BAR addressing support. |
| 84 | + |
| 85 | +Improving performance on the MI300X accelerator when using fewer than 8 GPUs |
| 86 | +============================================================================ |
| 87 | + |
| 88 | +On a system with 8\*MI300X accelerators, each pair of accelerators is connected with dedicated XGMI links |
| 89 | +in a fully-connected topology. For collective operations, this can achieve good performance when |
| 90 | +all 8 accelerators (and all XGMI links) are used. When fewer than 8 GPUs are used, however, this can only achieve a fraction |
| 91 | +of the potential bandwidth on the system. |
| 92 | +However, if your workload warrants using fewer than 8 MI300X accelerators on a system, |
| 93 | +you can set the run-time variable ``NCCL_MIN_NCHANNELS`` to increase the number of channels. For example: |
| 94 | + |
| 95 | +.. code-block:: shell |
| 96 | +
|
| 97 | + export NCCL_MIN_NCHANNELS=32 |
| 98 | +
|
| 99 | +Increasing the number of channels can benefit performance, but it also increases |
| 100 | +GPU utilization for collective operations. |
| 101 | +Additionally, RCCL pre-defines a higher number of channels when only 2 or |
| 102 | +4 accelerators are in use on a 8\*MI300X system. In this situation, RCCL uses 32 channels with two MI300X accelerators |
| 103 | +and 24 channels for four MI300X accelerators. |
0 commit comments