[BUG] Batched nn-descent UMAP unexpectedly throws OOM error on dataset that should succeed with UVM #6204

beckernick · 2025-01-03T00:35:24Z

Batched UMAP with nn-descent enables processing much larger datasets than before on a single GPU.

Some users want to process 100+ GB datasets on a single GPU, which can sometimes still overwhelm GPU memory depending on the UMAP parameters.

Managed memory should be a potential path to enabling these larger workloads.

When I enable RMM Managed Memory for a workload that should be just too big to fit in GPU memory on a machine with 2 TB of CPU RAM and an 80GB H100 GPU, I unexpectedly get an OOM error.

I was surprised to see this, as from watching top, I don't see CPU memory go anywhere near 2TB (peaks at ~300 GB CPU memory).

Batched nn-descent UMAP uses multiple CPU threads, but this error still occurs if OMP_NUM_THREADS=1 is set before execution.

import gc

import numpy as np
import rmm

rmm.mr.set_current_device_resource(rmm.mr.ManagedMemoryResource())

from cuml.manifold import UMAP

def do_umap(data, n_components=2, n_clusters=4, data_on_host=True):
    reducer = UMAP(
        n_components=n_components,
        build_algo="nn_descent",
        build_kwds={"nnd_n_clusters": n_clusters},
    )
    # Fit and transform the data
    embeddings = reducer.fit_transform(data, data_on_host=data_on_host)
    del embeddings
    del reducer
    gc.collect()

# 40e6 x 768, UMAP n_components=32, n_clusters=10 succeeds on the H100 80GB GPU without UVM
# 45e6 x 768, UMAP n_components=32, n_clusters=10 cannot succeed with or without UVM
N = int(45e6)
K = 768

rng = np.random.default_rng(seed=12)
synthetic_data = rng.random((N, K), dtype="float32")

print("Finished generating data")

print("Starting UMAP")
do_umap(synthetic_data, n_components=32, n_clusters=10)
print("Done")

$ CUDA_VISIBLE_DEVICES=7 python umap-uvm-oom-reproducer.py

Finished generating data
Starting UMAP

Traceback (most recent call last):
  File "/raid/nicholasb/umap-uvm-oom-reproducer.py", line 34, in <module>
    do_umap(synthetic_data, n_components=32, n_clusters=10)
  File "/raid/nicholasb/umap-uvm-oom-reproducer.py", line 18, in do_umap
    embeddings = reducer.fit_transform(data, data_on_host=data_on_host)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/raid/nicholasb/miniforge3/envs/bertopic/lib/python3.11/site-packages/cuml/internals/api_decorators.py", line 188, in
 wrapper
    ret = func(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^
  File "/raid/nicholasb/miniforge3/envs/bertopic/lib/python3.11/site-packages/cuml/internals/api_decorators.py", line 393, in
 dispatch
return self.dispatch_func(func_name, gpu_func, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/raid/nicholasb/miniforge3/envs/bertopic/lib/python3.11/site-packages/cuml/internals/api_decorators.py", line 190, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "base.pyx", line 720, in cuml.internals.base.UniversalBase.dispatch_func
  File "umap.pyx", line 741, in cuml.manifold.umap.UMAP.fit_transform
  File "/raid/nicholasb/miniforge3/envs/bertopic/lib/python3.11/site-packages/cuml/internals/api_decorators.py", line 188, in wrapper
    ret = func(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^
  File "/raid/nicholasb/miniforge3/envs/bertopic/lib/python3.11/site-packages/cuml/internals/api_decorators.py", line 393, in dispatch
    return self.dispatch_func(func_name, gpu_func, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/raid/nicholasb/miniforge3/envs/bertopic/lib/python3.11/site-packages/cuml/internals/api_decorators.py", line 190, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "base.pyx", line 720, in cuml.internals.base.UniversalBase.dispatch_func
  File "umap.pyx", line 678, in cuml.manifold.umap.UMAP.fit
MemoryError: std::bad_alloc: out_of_memory: CUDA error at: /tmp/pip-build-env-m_qegmhe/normal/lib/python3.11/site-packages/librmm/include/rmm/mr/device/managed_memory_resource.hpp:66: cudaErrorMemoryAllocation out of memory

This reproduces with both the stable cuML 24.12 and nightly 25.02.

The text was updated successfully, but these errors were encountered:

pablete · 2025-01-03T20:40:56Z

This happened to me as well. EC2 instance p4d.24xlarge (1.1TB of RAM, A100 with 40Gb of VRAM)
I am using vectors of dim 1280. I think the 45Million is the mark

Works!
N = int(40e6)
K = 1280

OOM!
N = int(45e6)
K = 1280

viclafargue · 2025-01-09T16:40:26Z

It looks like there are two issues that happen at different thresholds (different number of rows). The first one is a series of integer overflows in the Lanczos solver. rapidsai/raft#2536 should solve this issue. The second one appears in RAFT's sparse matrices utilities (COO symmetrization). The required nnz is larger that what can be held with indices on 32 bits. Solving this require additional work, namely storing the indices on 64 bits, and then modifying the kernels to account for the change. Changes in performance and VRAM usage are to be expected. It might be a good idea to have two pathways (small vs high number of rows).

Partially answers rapidsai/cuml#6204 Authors: - Victor Lafargue (https://github.com/viclafargue) - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Corey J. Nolet (https://github.com/cjnolet) - Micka (https://github.com/lowener) URL: #2536

pablete · 2025-01-16T19:48:11Z

Great! does nightly 25.01. or nightly 25.02. has the fix?

viclafargue · 2025-01-22T13:33:11Z

The fix for the integer overflows in the Lanczos solver has been merged for the 25.02 release. Other PRs should follow soon : rapidsai/raft#2541 and #6245 .

jcrist · 2025-01-24T19:32:18Z

I just built #6245 on top of rapidsai/raft#2541 and am still seeing an OOM running the above test script. This was on an 80 GB A100.

Error Log

$ CUDA_VISIBLE_DEVICES=3 python perf.py
Finished generating data
Starting UMAP
Traceback (most recent call last):
  File "/raid/jcristharif/cuml/perf.py", line 33, in <module>
    do_umap(synthetic_data, n_components=32, n_clusters=10)
  File "/raid/jcristharif/cuml/perf.py", line 17, in do_umap
    embeddings = reducer.fit_transform(data, data_on_host=data_on_host)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/raid/jcristharif/miniforge3/envs/cuml-test/lib/python3.12/site-packages/cuml/internals/api_decorators.py", line 188, in wrapper
    ret = func(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^
  File "/raid/jcristharif/miniforge3/envs/cuml-test/lib/python3.12/site-packages/cuml/internals/api_decorators.py", line 393, in dispatch
    return self.dispatch_func(func_name, gpu_func, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/raid/jcristharif/miniforge3/envs/cuml-test/lib/python3.12/site-packages/cuml/internals/api_decorators.py", line 190, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "base.pyx", line 788, in cuml.internals.base.UniversalBase.dispatch_func
  File "umap.pyx", line 756, in cuml.manifold.umap.UMAP.fit_transform
  File "/raid/jcristharif/miniforge3/envs/cuml-test/lib/python3.12/site-packages/cuml/internals/api_decorators.py", line 188, in wrapper
    ret = func(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^
  File "/raid/jcristharif/miniforge3/envs/cuml-test/lib/python3.12/site-packages/cuml/internals/api_decorators.py", line 393, in dispatch
    return self.dispatch_func(func_name, gpu_func, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/raid/jcristharif/miniforge3/envs/cuml-test/lib/python3.12/site-packages/cuml/internals/api_decorators.py", line 190, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "base.pyx", line 788, in cuml.internals.base.UniversalBase.dispatch_func
  File "umap.pyx", line 693, in cuml.manifold.umap.UMAP.fit
MemoryError: std::bad_alloc: out_of_memory: RMM failure at:/raid/jcristharif/miniforge3/envs/cuml-test/include/rmm/mr/device/limiting_resource_adaptor.hpp:152: Exceeded memory limit

jcrist · 2025-01-27T18:44:44Z

With disabling managed memory (commenting out the set_current_device_resource line) and setting n_clusters=10 the test case above completed fine on my machine. This was on an 80 GB A100, took around 35 mins for the script to run completely.

beckernick added ? - Needs Triage Need team to review and classify bug Something isn't working labels Jan 3, 2025

beckernick assigned viclafargue Jan 7, 2025

viclafargue mentioned this issue Jan 9, 2025

Fix lanczos solver integer overflow rapidsai/raft#2536

Merged

viclafargue mentioned this issue Jan 14, 2025

Allow some of the sparse utility functions to handle larger matrices rapidsai/raft#2541

Open

viclafargue mentioned this issue Jan 22, 2025

Fix UMAP issues with large inputs #6245

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Batched nn-descent UMAP unexpectedly throws OOM error on dataset that should succeed with UVM #6204

[BUG] Batched nn-descent UMAP unexpectedly throws OOM error on dataset that should succeed with UVM #6204

beckernick commented Jan 3, 2025 •

edited

Loading

pablete commented Jan 3, 2025

viclafargue commented Jan 9, 2025

pablete commented Jan 16, 2025

viclafargue commented Jan 22, 2025

jcrist commented Jan 24, 2025

jcrist commented Jan 27, 2025

[BUG] Batched nn-descent UMAP unexpectedly throws OOM error on dataset that should succeed with UVM #6204

[BUG] Batched nn-descent UMAP unexpectedly throws OOM error on dataset that should succeed with UVM #6204

Comments

beckernick commented Jan 3, 2025 • edited Loading

pablete commented Jan 3, 2025

viclafargue commented Jan 9, 2025

pablete commented Jan 16, 2025

viclafargue commented Jan 22, 2025

jcrist commented Jan 24, 2025

jcrist commented Jan 27, 2025

beckernick commented Jan 3, 2025 •

edited

Loading