Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Batched nn-descent UMAP unexpectedly throws OOM error on dataset that should succeed with UVM #6204

Open
beckernick opened this issue Jan 3, 2025 · 6 comments
Assignees
Labels
? - Needs Triage Need team to review and classify bug Something isn't working

Comments

@beckernick
Copy link
Member

beckernick commented Jan 3, 2025

Batched UMAP with nn-descent enables processing much larger datasets than before on a single GPU.

Some users want to process 100+ GB datasets on a single GPU, which can sometimes still overwhelm GPU memory depending on the UMAP parameters.

Managed memory should be a potential path to enabling these larger workloads.

When I enable RMM Managed Memory for a workload that should be just too big to fit in GPU memory on a machine with 2 TB of CPU RAM and an 80GB H100 GPU, I unexpectedly get an OOM error.

I was surprised to see this, as from watching top, I don't see CPU memory go anywhere near 2TB (peaks at ~300 GB CPU memory).

Batched nn-descent UMAP uses multiple CPU threads, but this error still occurs if OMP_NUM_THREADS=1 is set before execution.

import gc

import numpy as np
import rmm

rmm.mr.set_current_device_resource(rmm.mr.ManagedMemoryResource())

from cuml.manifold import UMAP

def do_umap(data, n_components=2, n_clusters=4, data_on_host=True):
    reducer = UMAP(
        n_components=n_components,
        build_algo="nn_descent",
        build_kwds={"nnd_n_clusters": n_clusters},
    )
    # Fit and transform the data
    embeddings = reducer.fit_transform(data, data_on_host=data_on_host)
    del embeddings
    del reducer
    gc.collect()

# 40e6 x 768, UMAP n_components=32, n_clusters=10 succeeds on the H100 80GB GPU without UVM
# 45e6 x 768, UMAP n_components=32, n_clusters=10 cannot succeed with or without UVM
N = int(45e6)
K = 768

rng = np.random.default_rng(seed=12)
synthetic_data = rng.random((N, K), dtype="float32")

print("Finished generating data")

print("Starting UMAP")
do_umap(synthetic_data, n_components=32, n_clusters=10)
print("Done")
$ CUDA_VISIBLE_DEVICES=7 python umap-uvm-oom-reproducer.py

Finished generating data
Starting UMAP

Traceback (most recent call last):
  File "/raid/nicholasb/umap-uvm-oom-reproducer.py", line 34, in <module>
    do_umap(synthetic_data, n_components=32, n_clusters=10)
  File "/raid/nicholasb/umap-uvm-oom-reproducer.py", line 18, in do_umap
    embeddings = reducer.fit_transform(data, data_on_host=data_on_host)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/raid/nicholasb/miniforge3/envs/bertopic/lib/python3.11/site-packages/cuml/internals/api_decorators.py", line 188, in
 wrapper
    ret = func(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^
  File "/raid/nicholasb/miniforge3/envs/bertopic/lib/python3.11/site-packages/cuml/internals/api_decorators.py", line 393, in
 dispatch
return self.dispatch_func(func_name, gpu_func, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/raid/nicholasb/miniforge3/envs/bertopic/lib/python3.11/site-packages/cuml/internals/api_decorators.py", line 190, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "base.pyx", line 720, in cuml.internals.base.UniversalBase.dispatch_func
  File "umap.pyx", line 741, in cuml.manifold.umap.UMAP.fit_transform
  File "/raid/nicholasb/miniforge3/envs/bertopic/lib/python3.11/site-packages/cuml/internals/api_decorators.py", line 188, in wrapper
    ret = func(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^
  File "/raid/nicholasb/miniforge3/envs/bertopic/lib/python3.11/site-packages/cuml/internals/api_decorators.py", line 393, in dispatch
    return self.dispatch_func(func_name, gpu_func, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/raid/nicholasb/miniforge3/envs/bertopic/lib/python3.11/site-packages/cuml/internals/api_decorators.py", line 190, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "base.pyx", line 720, in cuml.internals.base.UniversalBase.dispatch_func
  File "umap.pyx", line 678, in cuml.manifold.umap.UMAP.fit
MemoryError: std::bad_alloc: out_of_memory: CUDA error at: /tmp/pip-build-env-m_qegmhe/normal/lib/python3.11/site-packages/librmm/include/rmm/mr/device/managed_memory_resource.hpp:66: cudaErrorMemoryAllocation out of memory

This reproduces with both the stable cuML 24.12 and nightly 25.02.

@beckernick beckernick added ? - Needs Triage Need team to review and classify bug Something isn't working labels Jan 3, 2025
@pablete
Copy link

pablete commented Jan 3, 2025

This happened to me as well. EC2 instance p4d.24xlarge (1.1TB of RAM, A100 with 40Gb of VRAM)
I am using vectors of dim 1280. I think the 45Million is the mark

Works!
N = int(40e6)
K = 1280

OOM!
N = int(45e6)
K = 1280

@viclafargue
Copy link
Contributor

It looks like there are two issues that happen at different thresholds (different number of rows). The first one is a series of integer overflows in the Lanczos solver. rapidsai/raft#2536 should solve this issue. The second one appears in RAFT's sparse matrices utilities (COO symmetrization). The required nnz is larger that what can be held with indices on 32 bits. Solving this require additional work, namely storing the indices on 64 bits, and then modifying the kernels to account for the change. Changes in performance and VRAM usage are to be expected. It might be a good idea to have two pathways (small vs high number of rows).

@pablete
Copy link

pablete commented Jan 16, 2025

Great! does nightly 25.01. or nightly 25.02. has the fix?

@viclafargue
Copy link
Contributor

The fix for the integer overflows in the Lanczos solver has been merged for the 25.02 release. Other PRs should follow soon : rapidsai/raft#2541 and #6245 .

@jcrist
Copy link
Member

jcrist commented Jan 24, 2025

I just built #6245 on top of rapidsai/raft#2541 and am still seeing an OOM running the above test script. This was on an 80 GB A100.

Error Log
$ CUDA_VISIBLE_DEVICES=3 python perf.py
Finished generating data
Starting UMAP
Traceback (most recent call last):
  File "/raid/jcristharif/cuml/perf.py", line 33, in <module>
    do_umap(synthetic_data, n_components=32, n_clusters=10)
  File "/raid/jcristharif/cuml/perf.py", line 17, in do_umap
    embeddings = reducer.fit_transform(data, data_on_host=data_on_host)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/raid/jcristharif/miniforge3/envs/cuml-test/lib/python3.12/site-packages/cuml/internals/api_decorators.py", line 188, in wrapper
    ret = func(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^
  File "/raid/jcristharif/miniforge3/envs/cuml-test/lib/python3.12/site-packages/cuml/internals/api_decorators.py", line 393, in dispatch
    return self.dispatch_func(func_name, gpu_func, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/raid/jcristharif/miniforge3/envs/cuml-test/lib/python3.12/site-packages/cuml/internals/api_decorators.py", line 190, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "base.pyx", line 788, in cuml.internals.base.UniversalBase.dispatch_func
  File "umap.pyx", line 756, in cuml.manifold.umap.UMAP.fit_transform
  File "/raid/jcristharif/miniforge3/envs/cuml-test/lib/python3.12/site-packages/cuml/internals/api_decorators.py", line 188, in wrapper
    ret = func(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^
  File "/raid/jcristharif/miniforge3/envs/cuml-test/lib/python3.12/site-packages/cuml/internals/api_decorators.py", line 393, in dispatch
    return self.dispatch_func(func_name, gpu_func, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/raid/jcristharif/miniforge3/envs/cuml-test/lib/python3.12/site-packages/cuml/internals/api_decorators.py", line 190, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "base.pyx", line 788, in cuml.internals.base.UniversalBase.dispatch_func
  File "umap.pyx", line 693, in cuml.manifold.umap.UMAP.fit
MemoryError: std::bad_alloc: out_of_memory: RMM failure at:/raid/jcristharif/miniforge3/envs/cuml-test/include/rmm/mr/device/limiting_resource_adaptor.hpp:152: Exceeded memory limit

@jcrist
Copy link
Member

jcrist commented Jan 27, 2025

With disabling managed memory (commenting out the set_current_device_resource line) and setting n_clusters=10 the test case above completed fine on my machine. This was on an 80 GB A100, took around 35 mins for the script to run completely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
? - Needs Triage Need team to review and classify bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants