You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Scenario: sharing a CU_MEM_HANDLE_TYPE_FABRIC between two isolated GPUs via IMEX
Two containerized processes. Each process has cgroup access to just one individual GPU (two physically different GPUs are used overall).
The two processes have a shared IMEX channel (both containers have cgroup access to the same IMEX channel).
One process uses the VMM API call cuMemCreate() and then exports a handle of type CU_MEM_HANDLE_TYPE_FABRIC via cuMemExportToShareableHandle(). The handle data is transferred to the other process via network communication. The other process tries to import it via cuMemImportFromShareableHandle(). This sequence is at the heart of MNNVL.
1-node setup breaks, 2-node setup works
I worked on a repro that orchestrates this scenario with the DRA Driver 25.3.0-rc.1 for two different cases on a GH200 system:
both containers run on the same node
the two containers run on two different nodes
In both cases, the IMEX daemon setup is done by the DRA driver (with the ComputeDomain primitive). Application code is the same. The difference between both cases, as strictly as possible, is (only) the node count.
Handle import works when these two processes run on two different nodes (shared IMEX channel, 2-node healthy IMEX daemon setup managed by the DRA driver). Repro 2-nodes-works.sh demonstrates this.
Handle import fails with code 101 (invalid device) upon import when the two processes run on the same node (shared IMEX channel, 1-node healthy IMEX daemon setup, also managed by the DRA driver). Repro 1-node-breaks.sh demonstrates this.
Funny term and interesting philosophical question: should "Single-node MNNVL" work? It certainly should in the long run. But is it expected to work today? Seemingly yes-ish.
Specifically, it is seemingly expected that exporting a handle of type CU_MEM_HANDLE_TYPE_FABRIC on one GPU of a certain node can be imported on a different GPU on the same node even and especially with container isolation. Quotes from out-of-band communication:
For devices that we can't find in either exposed devices (or invisible devices) we assume that it's a multi node import so its likely that the device is hidden from us and we assume the allocation is a multi node import.
and
multinode path should work once you detect and fallback
Problem in NVIDIA driver or in container/k8s stack?
The error is thrown in cuMemImportFromShareableHandle(). The overall methodology (API usage, IMEX setup) seems to be correct. That suggests a problem in the CUDA runtime/driver.
However, potentially the problem is in 'our' stack (GPU Operator, DRA driver, container runtime, ...).
Goal: narrow down where the problem is and enable the use case of single-node IMEX-based handle transfer.
A note on the single-node setup
Orchestrating the single-nod esetup with the DRA driver 25.3.0 requires manually setting up a ResourceClaim (instead of using a resource claim template associated with a compute domain). That is an interesting limitation that I will describe (for the record) elsewhere.
An opaque handle representing a memory allocation that can be exported to processes in same or different nodes. For IPC between processes on different nodes they must be connected via the NVSwitch fabric.
Scenario: sharing a
CU_MEM_HANDLE_TYPE_FABRIC
between two isolated GPUs via IMEXcuMemCreate()
and then exports a handle of typeCU_MEM_HANDLE_TYPE_FABRIC
viacuMemExportToShareableHandle()
. The handle data is transferred to the other process via network communication. The other process tries to import it viacuMemImportFromShareableHandle()
. This sequence is at the heart of MNNVL.1-node setup breaks, 2-node setup works
I worked on a repro that orchestrates this scenario with the DRA Driver 25.3.0-rc.1 for two different cases on a GH200 system:
In both cases, the IMEX daemon setup is done by the DRA driver (with the
ComputeDomain
primitive). Application code is the same. The difference between both cases, as strictly as possible, is (only) the node count.Reproduction code and logs and more environment details are at https://github.com/jgehrcke/jpsnips-nv/tree/main/repros/imex-1node-fabric-hdl-import101.
The 1-node setup breaks, the 2-nodes setup works.
Specifically:
I first saw this in the context of NCCL (NVIDIA/nccl#1647), and then reduced this to a repro without NCCL, just using raw CUDA API, see https://github.com/jgehrcke/jpsnips-nv/blob/3e6176f4e46cdabeedf76054a8bb09a1aa7179ba/repros/imex-1node-fabric-hdl-import101/fabric-handle-transfer-test.py#L187.
"Single-node MNNVL" should work (right?)
Funny term and interesting philosophical question: should "Single-node MNNVL" work? It certainly should in the long run. But is it expected to work today? Seemingly yes-ish.
Specifically, it is seemingly expected that exporting a handle of type
CU_MEM_HANDLE_TYPE_FABRIC
on one GPU of a certain node can be imported on a different GPU on the same node even and especially with container isolation. Quotes from out-of-band communication:and
Problem in NVIDIA driver or in container/k8s stack?
The error is thrown in
cuMemImportFromShareableHandle()
. The overall methodology (API usage, IMEX setup) seems to be correct. That suggests a problem in the CUDA runtime/driver.However, potentially the problem is in 'our' stack (GPU Operator, DRA driver, container runtime, ...).
Goal: narrow down where the problem is and enable the use case of single-node IMEX-based handle transfer.
A note on the single-node setup
Orchestrating the single-nod esetup with the DRA driver 25.3.0 requires manually setting up a ResourceClaim (instead of using a resource claim template associated with a compute domain). That is an interesting limitation that I will describe (for the record) elsewhere.
Related resources
https://docs.nvidia.com/cuda/cuda-driver-api/structCUmemFabricHandle__v1.html#structCUmemFabricHandle__v1
(emphasis mine)
cuMemImportFromShareableHandle()
docscuMemCreate() docs
The text was updated successfully, but these errors were encountered: