-
Notifications
You must be signed in to change notification settings - Fork 548
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Batched nn-descent UMAP unexpectedly throws OOM error on dataset that should succeed with UVM #6204
Comments
This happened to me as well. EC2 instance p4d.24xlarge (1.1TB of RAM, A100 with 40Gb of VRAM) Works! OOM! |
It looks like there are two issues that happen at different thresholds (different number of rows). The first one is a series of integer overflows in the Lanczos solver. rapidsai/raft#2536 should solve this issue. The second one appears in RAFT's sparse matrices utilities (COO symmetrization). The required nnz is larger that what can be held with indices on 32 bits. Solving this require additional work, namely storing the indices on 64 bits, and then modifying the kernels to account for the change. Changes in performance and VRAM usage are to be expected. It might be a good idea to have two pathways (small vs high number of rows). |
Partially answers rapidsai/cuml#6204 Authors: - Victor Lafargue (https://github.com/viclafargue) - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Corey J. Nolet (https://github.com/cjnolet) - Micka (https://github.com/lowener) URL: #2536
Great! does nightly 25.01. or nightly 25.02. has the fix? |
The fix for the integer overflows in the Lanczos solver has been merged for the 25.02 release. Other PRs should follow soon : rapidsai/raft#2541 and #6245 . |
I just built #6245 on top of rapidsai/raft#2541 and am still seeing an OOM running the above test script. This was on an 80 GB A100. Error Log
|
With disabling managed memory (commenting out the |
Batched UMAP with nn-descent enables processing much larger datasets than before on a single GPU.
Some users want to process 100+ GB datasets on a single GPU, which can sometimes still overwhelm GPU memory depending on the UMAP parameters.
Managed memory should be a potential path to enabling these larger workloads.
When I enable RMM Managed Memory for a workload that should be just too big to fit in GPU memory on a machine with 2 TB of CPU RAM and an 80GB H100 GPU, I unexpectedly get an OOM error.
I was surprised to see this, as from watching
top
, I don't see CPU memory go anywhere near 2TB (peaks at ~300 GB CPU memory).Batched nn-descent UMAP uses multiple CPU threads, but this error still occurs if
OMP_NUM_THREADS=1
is set before execution.This reproduces with both the stable cuML 24.12 and nightly 25.02.
The text was updated successfully, but these errors were encountered: