You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently if I want to run UMAP with batched nn descent, I can call fit_transform() and this works. However, if I want to call fit and transform independently (e.g. to fit my data on just a subset of the overall dataset) only fit currently supports batched nn descent, while transform falls back to using brute force knn.
@btepera, one thing that makes this really challenging is that (even the CPU-based) UMAP requires the original training data to be kept around in order to figure out where the points belong during out-of-sample inference. Naturally if an approximate nearest neighbors index is used, we could just store that off for fast lookup, but that often still requires storing the original training vectors, unfortunately.
The other challenge is that our nn-descent implementation currently only supports constructing a knn graph (we call it an all-neighbors graph) on a single set of input vectors, and doesn't yet support constructing one from, say, a set of "index" vectors and a disjoint set of "lookup" vectors. This is not impossible to do, but UMAP ultimately requires that the "transform" perform a lookup for the closest training vectors for each of the "transform" vectors. It's just something we would need to add to cuVS in order to make this possible.
Question- if the feature I just mentioned above were to be written, would you be okay having the original training vectors store in the umap estimator in order to do the transform?
I was thinking through this a little further and it dawned on me that we can take the knn graph after the batched nn descent and run it through cagra's optimize() function so that we can store the CAGRA index on the umap estimator. Unfortunately, it doesn't remove the need to keep the raw vectors around, but it would work today out of the box for being able to do transform() with an ANN index. This is also similar to what the reference UMAP is doing.
Currently if I want to run UMAP with batched nn descent, I can call fit_transform() and this works. However, if I want to call fit and transform independently (e.g. to fit my data on just a subset of the overall dataset) only fit currently supports batched nn descent, while transform falls back to using brute force knn.
How much effort would be required for nn descent to support transform as well?
The text was updated successfully, but these errors were encountered: