Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Implement UMAP transform with batched nn descent #6215

Open
btepera opened this issue Jan 10, 2025 · 2 comments
Open

[FEA] Implement UMAP transform with batched nn descent #6215

btepera opened this issue Jan 10, 2025 · 2 comments
Labels
? - Needs Triage Need team to review and classify feature request New feature or request

Comments

@btepera
Copy link

btepera commented Jan 10, 2025

Currently if I want to run UMAP with batched nn descent, I can call fit_transform() and this works. However, if I want to call fit and transform independently (e.g. to fit my data on just a subset of the overall dataset) only fit currently supports batched nn descent, while transform falls back to using brute force knn.

import numpy as np
from cuml.manifold import UMAP

N = 10000
K = 32

rng = np.random.default_rng()
data = rng.random((N, K), dtype="float32")

reducer = UMAP(
    n_components=2,
    n_neighbors=15,
    build_algo="nn_descent",
    build_kwds={"nnd_n_clusters": 4},
)

fitted_umap = reducer.fit(data)
embeddings = fitted_umap.transform(data)

[I] [13:36:50.855150] Transform can only be run with brute force. Using brute force.

How much effort would be required for nn descent to support transform as well?

@cjnolet
Copy link
Member

cjnolet commented Jan 10, 2025

@btepera, one thing that makes this really challenging is that (even the CPU-based) UMAP requires the original training data to be kept around in order to figure out where the points belong during out-of-sample inference. Naturally if an approximate nearest neighbors index is used, we could just store that off for fast lookup, but that often still requires storing the original training vectors, unfortunately.

The other challenge is that our nn-descent implementation currently only supports constructing a knn graph (we call it an all-neighbors graph) on a single set of input vectors, and doesn't yet support constructing one from, say, a set of "index" vectors and a disjoint set of "lookup" vectors. This is not impossible to do, but UMAP ultimately requires that the "transform" perform a lookup for the closest training vectors for each of the "transform" vectors. It's just something we would need to add to cuVS in order to make this possible.

Question- if the feature I just mentioned above were to be written, would you be okay having the original training vectors store in the umap estimator in order to do the transform?

@cjnolet
Copy link
Member

cjnolet commented Jan 14, 2025

I was thinking through this a little further and it dawned on me that we can take the knn graph after the batched nn descent and run it through cagra's optimize() function so that we can store the CAGRA index on the umap estimator. Unfortunately, it doesn't remove the need to keep the raw vectors around, but it would work today out of the box for being able to do transform() with an ANN index. This is also similar to what the reference UMAP is doing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
? - Needs Triage Need team to review and classify feature request New feature or request
Projects
Status: No status
Development

No branches or pull requests

2 participants