Skip to content

Add cross data center communications and network topology awareness to NCCL #1659

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

thomasgillis
Copy link
Collaborator

@thomasgillis thomasgillis commented Mar 24, 2025

Cross Data Center Communication and network topology awareness

Goal: Enable NCCL to perform multi-DC communication with minimum modification to the AI training workloads.

This feature supports two use-cases for multi-DC communication:

  • Within the same communicator, different data centers are connected through different networks (typically IB, RoCE for the intra-DC network, and TCP for the inter-DC network).
  • Within the same communicator, different data centers are connected through the same network (typically IB or RoCE for both intra-DC and inter-DC network).

Prior NCCL releases already provide support for using different communication backends in separate communicators.

Enable the Usage of Multiple Networks

For NCCL to use multiple networks, one has to set NCCL_ALLNET_ENABLE=1. This will consequently disable the usage of collNet.
Further, we advise the user to set NCCL_ALLNET_FASTNET="IB" to make sure that NCCL knows which network will be used to detect the DC topology.

The Fabric ID

The feature relies on the concept of fabricId. It is used by NCCL to capture the topology information and ensure connectivity between the devices.
The fabricId is provided by the user, and the way to do so depends on the network plugin in use.
Below we detail the usage of our internal IB plugin.

In the internal IB plugin, the fabric Id is set through the environment variable NCCL_IB_HCA: NCCL_IB_HCA="=device:port:fabricID".
We encourage the user to use the = prefix to guarantee an exact match with the device name.
Further, the value of the fabricID should be a positive integer, up to (1<<48). It will be interpreted as DC_ID * MAX_RAILS + RAIL_ID, where MAX_RAILS can be set with NCCL_IB_FABRICID_MAXRAIL. If unset, each of the fabricId values will be interpreted as a railId (i.e. fabricId = railId and dcId = 0).

For example:

  • By default, fabricId=0 and fabricId=64 will represent two devices that are disconnected from each other.
  • Setting NCCL_IB_FABRICID_MAXRAIL=64 and again using fabricId=0 and fabricId=64 will represent two devices that are connected to each other, but in different data centers, respectively 0 and 1.
  • Still with NCCL_IB_FABRICID_MAXRAIL=64, fabricId=0 and fabricId=16 will be interpreted as devices belonging to the same data center but with no direct rail connectivity.
  • Two devices from different networks are always disconnected.
  • IB and RoCE devices receive different fabricIds by default, which guarantees that they are not connected to each other. This can be overridden by the user if needed.

Job Script

For mpirun based jobs, we recommend using a bash script to assign different fabricIds to each of the MPI Processes.

For example, the following script will divide the MPI processes into different DCs, each of them of size DC_SIZE. If unset, DC_SIZE will be set to the number of processes on the node. In our example, we assume 8 dual ports NICs (seen as 16 devices by ibv_devinfo).

#!/bin/bash
if [ -z "$OMPI_COMM_WORLD_SIZE" ] || [ -z "$OMPI_COMM_WORLD_RANK" ]; then
  # we don't have OMPI, try MPICH
  if [ -z "$PMI_SIZE" ] || [ -z "$PMI_RANK" ]; then
    # we don't have MPICH
    mpiName="NONE"
    worldSize=1
    worldRank=0
    localSize=1
  else
    mpiName="MPICH"
    worldSize=${PMI_SIZE:=1}
    worldRank=${PMI_RANK:=0}
    localSize=${MPI_LOCALNRANKS:=1}
  fi
else
  mpiName="OMPI"
  worldSize=${OMPI_COMM_WORLD_SIZE:=1}
  worldRank=${OMPI_COMM_WORLD_RANK:=0}
  localSize=${OMPI_COMM_WORLD_LOCAL_SIZE:=1}
fi

# decide on how many MPI ranks are part of the DC
dcSize=${DC_SIZE:=$localSize}

# gather the list of HCA
nrails=8
uid=$((nrails * (worldRank / dcSize)))
# get the HCA list
hca_list="="\
"mlx5_1::$(( uid+0 )),mlx5_2::$(( uid+0 )),"\
"mlx5_3::$(( uid+1 )),mlx5_4::$(( uid+1 )),"\
"mlx5_5::$(( uid+2 )),mlx5_6::$(( uid+2 )),"\
"mlx5_7::$(( uid+3 )),mlx5_8::$(( uid+3 )),"\
"mlx5_9::$(( uid+4 )),mlx5_10::$(( uid+4 )),"\
"mlx5_11::$(( uid+5 )),mlx5_12::$(( uid+5 )),"\
"mlx5_13::$(( uid+6 )),mlx5_14::$(( uid+6 )),"\
"mlx5_15::$(( uid+7 )),mlx5_16::$(( uid+7 ))"

echo "$(hostname) world rank ${worldRank}, world size ${worldSize}, dcSize ${dcSize}, uid ${uid} with using NCCL_IB_HCA=\"${hca_list}\""
# if DCs are connected through TCP:
NCCL_IB_HCA="${hca_list}" $@
# if DCs are connected through IB:
# NCCL_IB_FABRICID_MAXRAIL=${nrails} NCCL_IB_HCA="${hca_list}" $@

The internal tuning model of NCCL hasn't been adapted yet to the cross-DC communication.
Therefore, we recommend setting the desired algorithm to either RING or TREE.

Performance Considerations

The connection between DCs is very likely to drive the overall performance. We recommend testing a few values for various parameters in order to find the most suited parameter set. Here are some of the parameters we expect to drive the performance:

  • NCCL_IB_QPS_PER_CONNECTION to improve performance for higher latency IB connections.
  • NCCL_NSOCKS_PERTHREAD and NCCL_SOCKET_NTHREADS to improve performance for higher latency TCP connections.
  • (new in this branch) NCCL_SOCKET_INLINE and NCCL_SOCKET_MIN_TASKSIZE to control the size of the TCP messages and the size of the inlined data.
  • NCCL_BUFFSIZE, together with changing the value of NCCL_STEPS.
  • (new in this branch) NCCL_SCATTER_XDC: allows the scattering of the channels onto different NICs for the cross-DC connection. This will lead to channels following a different rank ordering within a single collective. For IB inter-DC network, we recommend setting the value to 0. For TCP inter-DC network, we recommend setting the value to 1.
  • NCCL_MIN_CTAS (with NCCL_SCATTER_XDC=1): for TCP connections, increasing the number of CTAs will increase the number of channels and therefore the number of TCP NICs that NCCL will use. If allowed to (see above), NCCL maps each channel to a different node. Therefore, the total number of NICs used within a single collective then depends on the number of channels used.

…o NCCL

Major changes:
- enable the usage of multiple networks within the same communicator with NCCL_ALLNET_ENABLE=1
- introduce the concept of the fabricID for each device
- add ncclNet v11 API with getNetPath to query for topology information
- leverage topology information to adapt RING and TREE algorithms in a hierarchical way

Signed-off-by: Thomas Gillis <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant