-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add correlation parameter for KNN performance tests #330
base: main
Are you sure you want to change the base?
Conversation
Here are some runs with cohere data.
|
I'm confused by this work (sorry, haven't read the Acorn paper). What is meant by correlation? Is it that filtered-out documents are more likely to be near each other when the filter is highly correlated? |
@msokolov This is all relative to the query vector. So a "highly correlated filter" is one where the filter-passing vectors are close to the query vector. |
I see, thanks, that makes sense! (that's a common use case, but sometimes they can definitely also be unrelated) |
Actually @msokolov, the question you asked is making me think: filtered vectors being clustered themselves is probably the best simulation of real-world "correlation." Then, of course, it matters how far the query vector is from this cluster. In my approach, a A better approach might be: generate a clustered filter, then run query vectors against it and determine a range of "correlation results" based on the the closeness of the query vector and that cluster. e.g. Generate an arbitrary clustered filter.
Hope that makes sense. I'll try to play around with this idea. |
We've been investigating implementing the "ACORN" filtered KNN search algorithm in Lucene. One aspect of filtered search which luceneutil can't test is "correlation" of the filter (see Figure 2 in the ACORN paper).
So I've implemented a
filterCorrelation
parameter to do just that. It works by generating a unique filter for each query vector.filterSelectivity
is preserved exactly, and the givenfilterCorrelation
is used to dial "how much" correlation there is.Here's an idea of how it works:
filterCorrelation < 0
, it starts by setting thefilterSelectivity
fraction of the lowest scores to 1 in the bit setfilterCorrelation > 0
it sets thefilterSelectivity
fraction of the highest scoresThen it takes a portion of the set bits (1 - |correlation|) and flips them with clear bits drawn randomly from the entire document set.
For example:
I landed on this approach because it provides a smooth-ish dial for correlation where -1 is the worst possible, 0 is totally random, and 1 is the best possible. It also works over any set of scores, no matter the distribution. It is a bit slow because it creates a filter over every query vector and calculates
ndoc
scores for each. On my machine it takes ~45s to generate the filters for 100,000 docs & 1,000 queries.