[FEA] Implement UMAP transform with batched nn descent #6215

btepera · 2025-01-10T21:48:40Z

Currently if I want to run UMAP with batched nn descent, I can call fit_transform() and this works. However, if I want to call fit and transform independently (e.g. to fit my data on just a subset of the overall dataset) only fit currently supports batched nn descent, while transform falls back to using brute force knn.

import numpy as np
from cuml.manifold import UMAP

N = 10000
K = 32

rng = np.random.default_rng()
data = rng.random((N, K), dtype="float32")

reducer = UMAP(
    n_components=2,
    n_neighbors=15,
    build_algo="nn_descent",
    build_kwds={"nnd_n_clusters": 4},
)

fitted_umap = reducer.fit(data)
embeddings = fitted_umap.transform(data)

[I] [13:36:50.855150] Transform can only be run with brute force. Using brute force.

How much effort would be required for nn descent to support transform as well?

cjnolet · 2025-01-10T22:10:22Z

@btepera, one thing that makes this really challenging is that (even the CPU-based) UMAP requires the original training data to be kept around in order to figure out where the points belong during out-of-sample inference. Naturally if an approximate nearest neighbors index is used, we could just store that off for fast lookup, but that often still requires storing the original training vectors, unfortunately.

The other challenge is that our nn-descent implementation currently only supports constructing a knn graph (we call it an all-neighbors graph) on a single set of input vectors, and doesn't yet support constructing one from, say, a set of "index" vectors and a disjoint set of "lookup" vectors. This is not impossible to do, but UMAP ultimately requires that the "transform" perform a lookup for the closest training vectors for each of the "transform" vectors. It's just something we would need to add to cuVS in order to make this possible.

Question- if the feature I just mentioned above were to be written, would you be okay having the original training vectors store in the umap estimator in order to do the transform?

cjnolet · 2025-01-14T21:51:48Z

I was thinking through this a little further and it dawned on me that we can take the knn graph after the batched nn descent and run it through cagra's optimize() function so that we can store the CAGRA index on the umap estimator. Unfortunately, it doesn't remove the need to keep the raw vectors around, but it would work today out of the box for being able to do transform() with an ANN index. This is also similar to what the reference UMAP is doing.

btepera added ? - Needs Triage Need team to review and classify feature request New feature or request labels Jan 10, 2025

btepera mentioned this issue Jan 10, 2025

[BUG] UMAP transform throws illegal memory access error when data_on_host=True #6216

Closed

cjnolet added this to VS/ML/DM Primitives Release Board Jan 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Implement UMAP transform with batched nn descent #6215

[FEA] Implement UMAP transform with batched nn descent #6215

btepera commented Jan 10, 2025

cjnolet commented Jan 10, 2025

cjnolet commented Jan 14, 2025

[FEA] Implement UMAP transform with batched nn descent #6215

[FEA] Implement UMAP transform with batched nn descent #6215

Comments

btepera commented Jan 10, 2025

cjnolet commented Jan 10, 2025

cjnolet commented Jan 14, 2025