Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ncclP2pImportShareableBuffer(): cuMemImportFromShareableHandle() fails with CUDA failure 101 #1647

Closed
jgehrcke opened this issue Mar 19, 2025 · 4 comments

Comments

@jgehrcke
Copy link

jgehrcke commented Mar 19, 2025

Environment

NCCL built from source. NCCL 2.26.2, CUDA 12.8, see

[2025-03-18 16:45:53] nickelpie-job-1:1:1 [0] NCCL INFO cudaDriverVersion 12080
[2025-03-18 16:45:53] nickelpie-job-1:1:1 [0] NCCL INFO NCCL version 2.26.2+cuda12.8

Setup: node-local GPUs, one GPU per process

Specifically:

  • One machine/node
  • 2 different processes, each operating on one GPU

NCCL communicator established

ncclCommInitRank() executed in both processes, communicator created . The handshake succeeded.

Rank 1 output:

[2025-03-18 21:25:52] nickelpie-job-1:1:1 [0] NCCL INFO comm 0x1ba15f20 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 1
...
[2025-03-18 21:25:52] nickelpie-job-1:1:1 [0] NCCL INFO ncclCommInitRank comm 0x1ba15f20 rank 1 nranks 2 cudaDev 0 nvmlDev 0 busId 901000 commId 0x470f96b16d11fb71 - Init COMPLETE

Rank 0 output:

[2025-03-18 21:25:52] nickelpie-job-0:1:1 [0] NCCL INFO comm 0x222431b0 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 1
...
[2025-03-18 21:25:52] nickelpie-job-0:1:1 [0] NCCL INFO ncclCommInitRank comm 0x222431b0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 1901000 commId 0x470f96b16d11fb71 - Init COMPLETE

Note: they share commId 0x470f96b16d11fb71 and have MNNVL set to 1. And the qualified reader certainly notes cudaDev 0 in both cases. These are containerized processes, each seeing one GPU (but different GPUs).

On timings for communicator creation, rank 0's perspective:

[2025-03-18 21:25:52] nickelpie-job-0:1:1 [0] NCCL INFO Init timings - ncclCommInitRank: rank 0 nranks 2 total 0.88 (kernels 0.15, alloc 0.01, bootstrap 0.63, allgathers 0.00, topo 0.01, graphs 0.00, connections 0.06, rest 0.02)

and rank 1's perspective:

[2025-03-18 21:25:52] nickelpie-job-1:1:1 [0] NCCL INFO Init timings - ncclCommInitRank: rank 1 nranks 2 total 0.22 (kernels 0.11, alloc 0.01, bootstrap 0.00, allgathers 0.00, topo 0.01, graphs 0.00, connections 0.06, rest 0.02)

Looks good, right?

send()-recv() test fails in ncclP2pImportShareableBuffer() with CUDA failure invalid device ordinal

Next up, there is a send() and a matching recv() call. Both fail at transport/p2p.cc with:

nickelpie-job-1:1:89 [0] transport/p2p.cc:277 NCCL WARN Cuda failure 101 'invalid device ordinal'

and

nickelpie-job-0:1:92 [0] transport/p2p.cc:277 NCCL WARN Cuda failure 101 'invalid device ordinal'

Why?

Did a bit of debugging, sharing my insights.

rank setup is OK, I think

There are no obvious mistakes here as far as I can tell:

  • rank 0 is the sender, and upon ncclSend() we inject rank 1 as the peer argument.
  • rank 1 is the receiver, and upon ncclRecv() we inject rank 0 as the peer argument.

I looked a little closer at what really fails.

bad CUmemFabricHandle injected?

The log points to Line 277 in p2p.cc.

That is, it's the function ncclP2pImportShareableBuffer() that fails, and specifically this CUDA API call which errors out:

CUCHECK(cuMemImportFromShareableHandle(&handle, cuDesc, type));

It does not receive a CUDA device ID as argument. So, it's not like we're explicitly operating on a wrong CUDA device ID here (as one might think in view of the message 'invalid device ordinal').

The second argument to cuMemImportFromShareableHandle() is documented as

Shareable Handle representing the memory allocation that is to be imported

So, that refers to a 'remote GPU', I suppose (in this case: the other GPU, managed by the other process). This argument is initialized as:

ncclCuDesc *cuDesc = &ipcDesc->cuDesc;

So, that somehow must refer to 'the other GPU', and that must be somewhat a wrong reference in this case here.

The third argument, type, in our case here is certainly not CU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR. The other possible types are documented here, and excluding win32-related types what remains are three:

CU_MEM_HANDLE_TYPE_NONE = 0x0
CU_MEM_HANDLE_TYPE_FABRIC = 0x8
CU_MEM_HANDLE_TYPE_MAX = 0x7FFFFFFF

I looked at the ncclIpcDesc type, it is defined as

typedef union {
  // Legacy CUDA IPC
  cudaIpcMemHandle_t devIpc;
  // cuMem API support
  struct {
    ncclCuDesc cuDesc;
    CUmemGenericAllocationHandle memHandle;
  };
} ncclIpcDesc;

and ncclCuDesc is defined as

typedef union {
  uint64_t data; // Needs to hold a CUmemGenericAllocationHandle for UDS fd support
  CUmemFabricHandle handle;
} ncclCuDesc;

So, we're closing in on a 'bad' argument of type CU_MEM_HANDLE_TYPE_FABRIC.

Here, I stop my naive follow-the-rabbit-hole debug, and hope that for one of the NCCL authors it's already quite easy to see what went wrong.

Both GPUs are represented by CUDA device index zero in their individual processes.

This is a Kubernetes environment and the two processes are containerized. They act on different GPUs in the same machine. CUDA code perceives them both (individually and independently) as CUDA device 0.

I understand that this might trip up some logic: but I think it shouldn't, right?

Is there logic in NCCL that relies on the idea that different GPUs in the same machine must always have a different CUDA device index? Maybe that's not the case, and the problem is elsewhere.

One GPU is:

250318-21:25:51.321 INFO: <CUDA Device 0> properties:
{'major': 9,
 'minor': 0,
 'multiGpuBoardGroupID': 0,
 'multiProcessorCount': 132,
 'name': b'NVIDIA GH200 96GB HBM3',
 'pciBusID': 1,
 'pciDeviceID': 0,
 'pciDomainID': 25,
 'uuid': UUID('ac687854-a9a8-1692-6388-97238a7e6923')}

The other is:

250318-21:25:52.025 INFO: <CUDA Device 0> properties:
{'major': 9,
 'minor': 0,
 'multiGpuBoardGroupID': 0,
 'multiProcessorCount': 132,
 'name': b'NVIDIA GH200 96GB HBM3',
 'pciBusID': 1,
 'pciDeviceID': 0,
 'pciDomainID': 9,
 'uuid': b'\xaf\x06\x84\xa1\xa4\x07\x15S\x0e\xc4u\x06\t\xf5\x96'}

more context

250318-21:25:52.022 INFO: env: NV_CUDA_CUDART_VERSION: 12.8.90-1
250318-21:25:52.023 INFO: env: CUDA_VERSION: 12.8.1
250318-21:25:52.023 INFO: env: NV_CUDA_LIB_VERSION: 12.8.1-1
250318-21:25:52.023 INFO: env: NV_CUDA_CUDART_DEV_VERSION: 12.8.90-1
250318-21:25:52.023 INFO: env: NV_CUDA_NSIGHT_COMPUTE_VERSION: 12.8.1-1

Both processes really have CUDA getDeviceCount() return 1.

CUDA_VISIBLE_DEVICES is not set.

More logs, in this case from rank 0 (the sender):

[2025-03-18 21:25:51] nickelpie-job-0:1:1 [0] NCCL INFO Bootstrap: Using eth0:192.168.33.51<0>
[2025-03-18 21:25:51] nickelpie-job-0:1:1 [0] NCCL INFO cudaDriverVersion 12080
[2025-03-18 21:25:51] nickelpie-job-0:1:1 [0] NCCL INFO NCCL version 2.26.2+cuda12.8
[2025-03-18 21:25:51] nickelpie-job-0:1:1 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal net plugin.
[2025-03-18 21:25:51] nickelpie-job-0:1:1 [0] NCCL INFO Failed to open libibverbs.so[.1]
[2025-03-18 21:25:51] nickelpie-job-0:1:1 [0] NCCL INFO NET/Socket : Using [0]eth0:192.168.33.51<0>
[2025-03-18 21:25:51] nickelpie-job-0:1:1 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. 
[2025-03-18 21:25:51] nickelpie-job-0:1:1 [0] NCCL INFO Using network Socket
[2025-03-18 21:25:51] nickelpie-job-0:1:1 [0] NCCL INFO ncclCommInitRank comm 0x222431b0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 1901000 commId 0x470f96b16d11fb71 - Init START
[2025-03-18 21:25:52] nickelpie-job-0:1:1 [0] NCCL INFO RAS client listening socket at ::1<28028>
[2025-03-18 21:25:52] nickelpie-job-0:1:1 [0] NCCL INFO Bootstrap timings total 0.633247 (create 0.000033, send 0.000110, recv 0.632491, ring 0.000067, delay 0.000002)
[2025-03-18 21:25:52] nickelpie-job-0:1:1 [0] NCCL INFO MNNVL busId 0x1901000 fabric UUID 834c74bda8d3be1f.493bc5eb7cfcbaf cliqueId 0x1 state 3 healthMask 0x2
[2025-03-18 21:25:52] nickelpie-job-0:1:1 [0] NCCL INFO NCCL_MNNVL_ENABLE set by environment to 1.
[2025-03-18 21:25:52] nickelpie-job-0:1:1 [0] NCCL INFO MNNVL 1 cliqueId 1 cliqueSize 2 cliqueRank 0
[2025-03-18 21:25:52] nickelpie-job-0:1:1 [0] NCCL INFO Setting affinity for GPU 0 to ffff,ffffffff,ffffff00,00000000,00000000
[2025-03-18 21:25:52] nickelpie-job-0:1:1 [0] NCCL INFO comm 0x222431b0 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 1

...
[2025-03-18 21:25:52] nickelpie-job-0:1:1 [0] NCCL INFO Channel 10/24 : 0 1
[2025-03-18 21:25:52] nickelpie-job-0:1:1 [0] NCCL INFO Channel 11/24 : 0 1
[2025-03-18 21:25:52] nickelpie-job-0:1:1 [0] NCCL INFO Channel 12/24 : 0 1
...

[2025-03-18 21:25:52] nickelpie-job-0:1:1 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] -1/-1/-1->0->1 [7] -1/-1/-1->0->1 [8] -1/-1/-1->0->1 [9] -1/-1/-1->0->1 [10] -1/-1/-1->0->1 [11] -1/-1/-1->0->1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] -1/-1/-1->0->1 [19] -1/-1/-1->0->1 [20] -1/-1/-1->0->1 [21] -1/-1/-1->0->1 [22] -1/-1/-1->0->1 [23] -1/-1/-1->0->1
[2025-03-18 21:25:52] nickelpie-job-0:1:1 [0] NCCL INFO P2P Chunksize set to 524288
[2025-03-18 21:25:52] nickelpie-job-0:1:1 [0] NCCL INFO Check P2P Type intraNodeP2pSupport 1 directMode 0
[2025-03-18 21:25:52] nickelpie-job-0:1:90 [0] NCCL INFO [Proxy Service] Device 0 CPU core 112
[2025-03-18 21:25:52] nickelpie-job-0:1:91 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 84
[2025-03-18 21:25:52] nickelpie-job-0:1:1 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
[2025-03-18 21:25:52] nickelpie-job-0:1:1 [0] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
[2025-03-18 21:25:52] nickelpie-job-0:1:1 [0] NCCL INFO CC Off, workFifoBytes 1048576
[2025-03-18 21:25:52] nickelpie-job-0:1:1 [0] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin.
[2025-03-18 21:25:52] nickelpie-job-0:1:1 [0] NCCL INFO ncclCommInitRank comm 0x222431b0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 1901000 commId 0x470f96b16d11fb71 - Init COMPLETE
[2025-03-18 21:25:52] nickelpie-job-0:1:1 [0] NCCL INFO Init timings - ncclCommInitRank: rank 0 nranks 2 total 0.88 (kernels 0.15, alloc 0.01, bootstrap 0.63, allgathers 0.00, topo 0.01, graphs 0.00, connections 0.06, rest 0.02)
[2025-03-18 21:25:59] nickelpie-job-0:1:92 [0] NCCL INFO Channel 00/1 : 0[0] -> 1[0] via P2P/MNNVL
[2025-03-18 21:25:59] nickelpie-job-0:1:92 [0] NCCL INFO Channel 01/1 : 0[0] -> 1[0] via P2P/MNNVL

...
[2025-03-18 21:25:59] nickelpie-job-0:1:92 [0] NCCL INFO Channel 30/1 : 0[0] -> 1[0] via P2P/MNNVL
[2025-03-18 21:25:59] nickelpie-job-0:1:92 [0] NCCL INFO Channel 31/1 : 0[0] -> 1[0] via P2P/MNNVL
...
nickelpie-job-0:1:92 [0] transport/p2p.cc:277 NCCL WARN Cuda failure 101 'invalid device ordinal'
[2025-03-18 21:25:59] nickelpie-job-0:1:92 [0] NCCL INFO transport/p2p.cc:352 -> 1
[2025-03-18 21:25:59] nickelpie-job-0:1:92 [0] NCCL INFO transport/p2p.cc:489 -> 1
[2025-03-18 21:25:59] nickelpie-job-0:1:92 [0] NCCL INFO transport.cc:197 -> 1
[2025-03-18 21:25:59] nickelpie-job-0:1:92 [0] NCCL INFO group.cc:133 -> 1
[2025-03-18 21:25:59] nickelpie-job-0:1:92 [0] NCCL INFO group.cc:75 -> 1 [Async thread]
[2025-03-18 21:25:59] nickelpie-job-0:1:1 [0] NCCL INFO group.cc:422 -> 1
[2025-03-18 21:25:59] nickelpie-job-0:1:1 [0] NCCL INFO group.cc:581 -> 1
[2025-03-18 21:25:59] nickelpie-job-0:1:1 [0] NCCL INFO enqueue.cc:2299 -> 1
[2025-03-18 21:25:59] nickelpie-job-0:1:93 [0] NCCL INFO misc/socket.cc:64 -> 3
[2025-03-18 21:25:59] nickelpie-job-0:1:93 [0] NCCL INFO misc/socket.cc:80 -> 3
[2025-03-18 21:25:59] nickelpie-job-0:1:93 [0] NCCL INFO misc/socket.cc:829 -> 3
[2025-03-18 21:25:59] nickelpie-job-0:1:90 [0] NCCL INFO misc/socket.cc:881 -> 3
[2025-03-18 21:26:00] nickelpie-job-0:1:93 [0] NCCL INFO comm 0x222431b0 rank 0 nranks 2 cudaDev 0 busId 1901000 - Abort COMPLETE

Great logging, so you've built a custom stack trace here.. :-)

[2025-03-18 21:25:59] nickelpie-job-0:1:92 [0] NCCL INFO transport/p2p.cc:352 -> 1
[2025-03-18 21:25:59] nickelpie-job-0:1:92 [0] NCCL INFO transport/p2p.cc:489 -> 1
[2025-03-18 21:25:59] nickelpie-job-0:1:92 [0] NCCL INFO transport.cc:197 -> 1

this shows that what fails is ncclTransportP2pSetup() -> p2pSendConnect() -> p2pMap() -> ncclP2pImportShareableBuffer() -> cuMemImportFromShareableHandle().

Further debug info

Let me know what I can do to further help here. And thanks for the great work on NCCL.

@kiskra-nvidia
Copy link
Member

Great detective work! 😃

What's the container runtime that you use? If you run something like nvidia-smi topo -m inside each container, are both GPUs visible? Could you check using nvbandwidth if you are able to establish communication between them?

A bit of a background: typically there's no P2P communication in NCCL between containers. NCCL sees the containers as separate nodes and doesn't even try. So it's interesting that in your case the two containers are in fact recognized as running on a single node -- do they not have separate file system namespaces and such? I can see that at least the hostnames are different... Even if NCCL were to recognize that they are on the same host, it would face the difficulty of passing the memory handle from one process to the other. Normally we use CU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR handles that we pass via the UNIX domain sockets, but with separate file system name spaces between containers, there's no way to establish one. On MNNVL systems, however, we switch to the fabric handles, which can be passed around via TCP/IP sockets. Truth be told, I've wondered for some time if in a case such as yours we could have P2P communication cross the container boundary. You might be the first person we know to have tried it, and... it doesn't work? Right now though we don't know if it's a bug or an inherent limitation, and whether it's in NCCL or CUDA.

I wouldn't think that the collision of CUDA device ids would be responsible. You can create that easily by setting CUDA_VISIBLE_DEVICES to a different value on each process, and NCCL supports such scenarios already.

@kiskra-nvidia
Copy link
Member

Oh yeah, can you post complete NCCL_DEBUG=INFO logs from both ranks? I guess we have one from rank 0 already...

@kiskra-nvidia
Copy link
Member

So it's interesting that in your case the two containers are in fact recognized as running on a single node

Scratch that. It's almost certainly an artifact of MNNVL being in use. In such cases we fuse the topologies of individual nodes and make them look like a single, large node. That's what must be going on here...

@jgehrcke
Copy link
Author

jgehrcke commented Mar 25, 2025

Thank you @kiskra-nvidia -- there was quite a bit of out-of-band communication about this. We have now confirmed that this is believed to be a known and old problem in the 570.00 CUDA usermode driver (UMD) -- fixed in newer versions.

So, that should answer your question:

Right now though we don't know if it's a bug or an inherent limitation, and whether it's in NCCL or CUDA.

(also reproduced this without NCCL in NVIDIA/k8s-dra-driver-gpu#294).

I have learned by now: for the scenario of single-node and single-GPU-per-container the expectation is that there is a fallback to the multi-node export/import path (basically MNNVL on single-node), and that wasn't working. This has since been fixed, and the system were I oberved the error reported above had a too-old driver.

Once I confirm that this problem went away with a newer driver I will report back again. Closing for now already.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants