-
Notifications
You must be signed in to change notification settings - Fork 891
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ncclP2pImportShareableBuffer()
: cuMemImportFromShareableHandle()
fails with CUDA failure 101
#1647
Comments
Great detective work! 😃 What's the container runtime that you use? If you run something like A bit of a background: typically there's no P2P communication in NCCL between containers. NCCL sees the containers as separate nodes and doesn't even try. So it's interesting that in your case the two containers are in fact recognized as running on a single node -- do they not have separate file system namespaces and such? I can see that at least the hostnames are different... Even if NCCL were to recognize that they are on the same host, it would face the difficulty of passing the memory handle from one process to the other. Normally we use I wouldn't think that the collision of CUDA device ids would be responsible. You can create that easily by setting CUDA_VISIBLE_DEVICES to a different value on each process, and NCCL supports such scenarios already. |
Oh yeah, can you post complete |
Scratch that. It's almost certainly an artifact of MNNVL being in use. In such cases we fuse the topologies of individual nodes and make them look like a single, large node. That's what must be going on here... |
Thank you @kiskra-nvidia -- there was quite a bit of out-of-band communication about this. We have now confirmed that this is believed to be a known and old problem in the 570.00 CUDA usermode driver (UMD) -- fixed in newer versions. So, that should answer your question:
(also reproduced this without NCCL in NVIDIA/k8s-dra-driver-gpu#294). I have learned by now: for the scenario of single-node and single-GPU-per-container the expectation is that there is a fallback to the multi-node export/import path (basically MNNVL on single-node), and that wasn't working. This has since been fixed, and the system were I oberved the error reported above had a too-old driver. Once I confirm that this problem went away with a newer driver I will report back again. Closing for now already. |
Environment
NCCL built from source. NCCL 2.26.2, CUDA 12.8, see
Setup: node-local GPUs, one GPU per process
Specifically:
NCCL communicator established
ncclCommInitRank()
executed in both processes, communicator created . The handshake succeeded.Rank 1 output:
Rank 0 output:
Note: they share
commId 0x470f96b16d11fb71
and have MNNVL set to 1. And the qualified reader certainly notescudaDev 0
in both cases. These are containerized processes, each seeing one GPU (but different GPUs).On timings for communicator creation, rank 0's perspective:
and rank 1's perspective:
Looks good, right?
send()-recv() test fails in
ncclP2pImportShareableBuffer()
with CUDA failureinvalid device ordinal
Next up, there is a send() and a matching recv() call. Both fail at
transport/p2p.cc
with:and
Why?
Did a bit of debugging, sharing my insights.
rank setup is OK, I think
There are no obvious mistakes here as far as I can tell:
ncclSend()
we inject rank 1 as the peer argument.ncclRecv()
we inject rank 0 as the peer argument.I looked a little closer at what really fails.
bad
CUmemFabricHandle
injected?The log points to Line 277 in p2p.cc.
That is, it's the function
ncclP2pImportShareableBuffer()
that fails, and specifically this CUDA API call which errors out:It does not receive a CUDA device ID as argument. So, it's not like we're explicitly operating on a wrong CUDA device ID here (as one might think in view of the message 'invalid device ordinal').
The second argument to
cuMemImportFromShareableHandle()
is documented asSo, that refers to a 'remote GPU', I suppose (in this case: the other GPU, managed by the other process). This argument is initialized as:
So, that somehow must refer to 'the other GPU', and that must be somewhat a wrong reference in this case here.
The third argument,
type
, in our case here is certainly notCU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR
. The other possible types are documented here, and excluding win32-related types what remains are three:I looked at the
ncclIpcDesc
type, it is defined asand
ncclCuDesc
is defined asSo, we're closing in on a 'bad' argument of type
CU_MEM_HANDLE_TYPE_FABRIC
.Here, I stop my naive follow-the-rabbit-hole debug, and hope that for one of the NCCL authors it's already quite easy to see what went wrong.
Both GPUs are represented by CUDA device index zero in their individual processes.
This is a Kubernetes environment and the two processes are containerized. They act on different GPUs in the same machine. CUDA code perceives them both (individually and independently) as CUDA device 0.
I understand that this might trip up some logic: but I think it shouldn't, right?
Is there logic in NCCL that relies on the idea that different GPUs in the same machine must always have a different CUDA device index? Maybe that's not the case, and the problem is elsewhere.
One GPU is:
The other is:
more context
Both processes really have CUDA getDeviceCount() return 1.
CUDA_VISIBLE_DEVICES
is not set.More logs, in this case from rank 0 (the sender):
Great logging, so you've built a custom stack trace here.. :-)
this shows that what fails is
ncclTransportP2pSetup() -> p2pSendConnect() -> p2pMap() -> ncclP2pImportShareableBuffer() -> cuMemImportFromShareableHandle()
.Further debug info
Let me know what I can do to further help here. And thanks for the great work on NCCL.
The text was updated successfully, but these errors were encountered: