Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MNNVL: single-node IMEX-based handle transfer fails #294

Open
jgehrcke opened this issue Mar 25, 2025 · 0 comments
Open

MNNVL: single-node IMEX-based handle transfer fails #294

jgehrcke opened this issue Mar 25, 2025 · 0 comments

Comments

@jgehrcke
Copy link
Contributor

Scenario: sharing a CU_MEM_HANDLE_TYPE_FABRIC between two isolated GPUs via IMEX

  • Two containerized processes. Each process has cgroup access to just one individual GPU (two physically different GPUs are used overall).
  • The two processes have a shared IMEX channel (both containers have cgroup access to the same IMEX channel).
  • One process uses the VMM API call cuMemCreate() and then exports a handle of type CU_MEM_HANDLE_TYPE_FABRIC via cuMemExportToShareableHandle(). The handle data is transferred to the other process via network communication. The other process tries to import it via cuMemImportFromShareableHandle(). This sequence is at the heart of MNNVL.

1-node setup breaks, 2-node setup works

I worked on a repro that orchestrates this scenario with the DRA Driver 25.3.0-rc.1 for two different cases on a GH200 system:

  • both containers run on the same node
  • the two containers run on two different nodes

In both cases, the IMEX daemon setup is done by the DRA driver (with the ComputeDomain primitive). Application code is the same. The difference between both cases, as strictly as possible, is (only) the node count.

Reproduction code and logs and more environment details are at https://github.com/jgehrcke/jpsnips-nv/tree/main/repros/imex-1node-fabric-hdl-import101.

The 1-node setup breaks, the 2-nodes setup works.

Specifically:

  • Handle import works when these two processes run on two different nodes (shared IMEX channel, 2-node healthy IMEX daemon setup managed by the DRA driver). Repro 2-nodes-works.sh demonstrates this.
  • Handle import fails with code 101 (invalid device) upon import when the two processes run on the same node (shared IMEX channel, 1-node healthy IMEX daemon setup, also managed by the DRA driver). Repro 1-node-breaks.sh demonstrates this.

I first saw this in the context of NCCL (NVIDIA/nccl#1647), and then reduced this to a repro without NCCL, just using raw CUDA API, see https://github.com/jgehrcke/jpsnips-nv/blob/3e6176f4e46cdabeedf76054a8bb09a1aa7179ba/repros/imex-1node-fabric-hdl-import101/fabric-handle-transfer-test.py#L187.

"Single-node MNNVL" should work (right?)

Funny term and interesting philosophical question: should "Single-node MNNVL" work? It certainly should in the long run. But is it expected to work today? Seemingly yes-ish.

Specifically, it is seemingly expected that exporting a handle of type CU_MEM_HANDLE_TYPE_FABRIC on one GPU of a certain node can be imported on a different GPU on the same node even and especially with container isolation. Quotes from out-of-band communication:

For devices that we can't find in either exposed devices (or invisible devices) we assume that it's a multi node import so its likely that the device is hidden from us and we assume the allocation is a multi node import.

and

multinode path should work once you detect and fallback

Problem in NVIDIA driver or in container/k8s stack?

The error is thrown in cuMemImportFromShareableHandle(). The overall methodology (API usage, IMEX setup) seems to be correct. That suggests a problem in the CUDA runtime/driver.

However, potentially the problem is in 'our' stack (GPU Operator, DRA driver, container runtime, ...).

Goal: narrow down where the problem is and enable the use case of single-node IMEX-based handle transfer.

A note on the single-node setup

Orchestrating the single-nod esetup with the DRA driver 25.3.0 requires manually setting up a ResourceClaim (instead of using a resource claim template associated with a compute domain). That is an interesting limitation that I will describe (for the record) elsewhere.

Related resources

https://docs.nvidia.com/cuda/cuda-driver-api/structCUmemFabricHandle__v1.html#structCUmemFabricHandle__v1

An opaque handle representing a memory allocation that can be exported to processes in same or different nodes. For IPC between processes on different nodes they must be connected via the NVSwitch fabric.

(emphasis mine)

cuMemImportFromShareableHandle() docs

cuMemCreate() docs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant