Add cross data center communications and network topology awareness to NCCL #1659
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Cross Data Center Communication and network topology awareness
Goal: Enable NCCL to perform multi-DC communication with minimum modification to the AI training workloads.
This feature supports two use-cases for multi-DC communication:
Prior NCCL releases already provide support for using different communication backends in separate communicators.
Enable the Usage of Multiple Networks
For NCCL to use multiple networks, one has to set
NCCL_ALLNET_ENABLE=1
. This will consequently disable the usage ofcollNet
.Further, we advise the user to set
NCCL_ALLNET_FASTNET="IB"
to make sure that NCCL knows which network will be used to detect the DC topology.The Fabric ID
The feature relies on the concept of
fabricId
. It is used by NCCL to capture the topology information and ensure connectivity between the devices.The
fabricId
is provided by the user, and the way to do so depends on the network plugin in use.Below we detail the usage of our internal IB plugin.
In the internal IB plugin, the fabric Id is set through the environment variable
NCCL_IB_HCA
:NCCL_IB_HCA="=device:port:fabricID"
.We encourage the user to use the
=
prefix to guarantee an exact match with the device name.Further, the value of the
fabricID
should be a positive integer, up to(1<<48)
. It will be interpreted asDC_ID * MAX_RAILS + RAIL_ID
, whereMAX_RAILS
can be set withNCCL_IB_FABRICID_MAXRAIL
. If unset, each of the fabricId values will be interpreted as a railId (i.e.fabricId = railId
anddcId = 0
).For example:
fabricId=0
andfabricId=64
will represent two devices that are disconnected from each other.NCCL_IB_FABRICID_MAXRAIL=64
and again usingfabricId=0
andfabricId=64
will represent two devices that are connected to each other, but in different data centers, respectively0
and1
.NCCL_IB_FABRICID_MAXRAIL=64
,fabricId=0
andfabricId=16
will be interpreted as devices belonging to the same data center but with no direct rail connectivity.Job Script
For
mpirun
based jobs, we recommend using a bash script to assign different fabricIds to each of the MPI Processes.For example, the following script will divide the MPI processes into different DCs, each of them of size
DC_SIZE
. If unset,DC_SIZE
will be set to the number of processes on the node. In our example, we assume 8 dual ports NICs (seen as 16 devices byibv_devinfo
).The internal tuning model of NCCL hasn't been adapted yet to the cross-DC communication.
Therefore, we recommend setting the desired algorithm to either
RING
orTREE
.Performance Considerations
The connection between DCs is very likely to drive the overall performance. We recommend testing a few values for various parameters in order to find the most suited parameter set. Here are some of the parameters we expect to drive the performance:
NCCL_IB_QPS_PER_CONNECTION
to improve performance for higher latency IB connections.NCCL_NSOCKS_PERTHREAD
andNCCL_SOCKET_NTHREADS
to improve performance for higher latency TCP connections.NCCL_SOCKET_INLINE
andNCCL_SOCKET_MIN_TASKSIZE
to control the size of the TCP messages and the size of the inlined data.NCCL_BUFFSIZE
, together with changing the value ofNCCL_STEPS
.NCCL_SCATTER_XDC
: allows the scattering of the channels onto different NICs for the cross-DC connection. This will lead to channels following a different rank ordering within a single collective. For IB inter-DC network, we recommend setting the value to0
. For TCP inter-DC network, we recommend setting the value to1
.NCCL_MIN_CTAS
(withNCCL_SCATTER_XDC=1
): for TCP connections, increasing the number of CTAs will increase the number of channels and therefore the number of TCP NICs that NCCL will use. If allowed to (see above), NCCL maps each channel to a different node. Therefore, the total number of NICs used within a single collective then depends on the number of channels used.