MNNVL: single-node IMEX-based handle transfer fails #294

jgehrcke · 2025-03-25T12:31:40Z

Scenario: sharing a `CU_MEM_HANDLE_TYPE_FABRIC` between two isolated GPUs via IMEX

Two containerized processes. Each process has cgroup access to just one individual GPU (two physically different GPUs are used overall).
The two processes have a shared IMEX channel (both containers have cgroup access to the same IMEX channel).
One process uses the VMM API call cuMemCreate() and then exports a handle of type CU_MEM_HANDLE_TYPE_FABRIC via cuMemExportToShareableHandle(). The handle data is transferred to the other process via network communication. The other process tries to import it via cuMemImportFromShareableHandle(). This sequence is at the heart of MNNVL.

1-node setup breaks, 2-node setup works

I worked on a repro that orchestrates this scenario with the DRA Driver 25.3.0-rc.1 for two different cases on a GH200 system:

both containers run on the same node
the two containers run on two different nodes

In both cases, the IMEX daemon setup is done by the DRA driver (with the ComputeDomain primitive). Application code is the same. The difference between both cases, as strictly as possible, is (only) the node count.

Reproduction code and logs and more environment details are at https://github.com/jgehrcke/jpsnips-nv/tree/main/repros/imex-1node-fabric-hdl-import101.

The 1-node setup breaks, the 2-nodes setup works.

Specifically:

Handle import works when these two processes run on two different nodes (shared IMEX channel, 2-node healthy IMEX daemon setup managed by the DRA driver). Repro 2-nodes-works.sh demonstrates this.
Handle import fails with code 101 (invalid device) upon import when the two processes run on the same node (shared IMEX channel, 1-node healthy IMEX daemon setup, also managed by the DRA driver). Repro 1-node-breaks.sh demonstrates this.

I first saw this in the context of NCCL (NVIDIA/nccl#1647), and then reduced this to a repro without NCCL, just using raw CUDA API, see https://github.com/jgehrcke/jpsnips-nv/blob/3e6176f4e46cdabeedf76054a8bb09a1aa7179ba/repros/imex-1node-fabric-hdl-import101/fabric-handle-transfer-test.py#L187.

"Single-node MNNVL" should work (right?)

Funny term and interesting philosophical question: should "Single-node MNNVL" work? It certainly should in the long run. But is it expected to work today? Seemingly yes-ish.

Specifically, it is seemingly expected that exporting a handle of type CU_MEM_HANDLE_TYPE_FABRIC on one GPU of a certain node can be imported on a different GPU on the same node even and especially with container isolation. Quotes from out-of-band communication:

For devices that we can't find in either exposed devices (or invisible devices) we assume that it's a multi node import so its likely that the device is hidden from us and we assume the allocation is a multi node import.

and

multinode path should work once you detect and fallback

Problem in NVIDIA driver or in container/k8s stack?

The error is thrown in cuMemImportFromShareableHandle(). The overall methodology (API usage, IMEX setup) seems to be correct. That suggests a problem in the CUDA runtime/driver.

However, potentially the problem is in 'our' stack (GPU Operator, DRA driver, container runtime, ...).

Goal: narrow down where the problem is and enable the use case of single-node IMEX-based handle transfer.

A note on the single-node setup

Orchestrating the single-nod esetup with the DRA driver 25.3.0 requires manually setting up a ResourceClaim (instead of using a resource claim template associated with a compute domain). That is an interesting limitation that I will describe (for the record) elsewhere.

Related resources

https://docs.nvidia.com/cuda/cuda-driver-api/structCUmemFabricHandle__v1.html#structCUmemFabricHandle__v1

An opaque handle representing a memory allocation that can be exported to processes in same or different nodes. For IPC between processes on different nodes they must be connected via the NVSwitch fabric.

(emphasis mine)

cuMemImportFromShareableHandle() docs

cuMemCreate() docs

The text was updated successfully, but these errors were encountered:

jgehrcke mentioned this issue Mar 25, 2025

ncclP2pImportShareableBuffer(): cuMemImportFromShareableHandle() fails with CUDA failure 101 NVIDIA/nccl#1647

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MNNVL: single-node IMEX-based handle transfer fails #294

MNNVL: single-node IMEX-based handle transfer fails #294

jgehrcke commented Mar 25, 2025

MNNVL: single-node IMEX-based handle transfer fails #294

MNNVL: single-node IMEX-based handle transfer fails #294

Comments

jgehrcke commented Mar 25, 2025

Scenario: sharing a CU_MEM_HANDLE_TYPE_FABRIC between two isolated GPUs via IMEX

1-node setup breaks, 2-node setup works

"Single-node MNNVL" should work (right?)

Problem in NVIDIA driver or in container/k8s stack?

A note on the single-node setup

Related resources

Scenario: sharing a `CU_MEM_HANDLE_TYPE_FABRIC` between two isolated GPUs via IMEX