Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add correlation parameter for KNN performance tests #330

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

benchaplin
Copy link

@benchaplin benchaplin commented Jan 17, 2025

We've been investigating implementing the "ACORN" filtered KNN search algorithm in Lucene. One aspect of filtered search which luceneutil can't test is "correlation" of the filter (see Figure 2 in the ACORN paper).

So I've implemented a filterCorrelation parameter to do just that. It works by generating a unique filter for each query vector. filterSelectivity is preserved exactly, and the given filterCorrelation is used to dial "how much" correlation there is.

Here's an idea of how it works:

  • For filterCorrelation < 0, it starts by setting the filterSelectivity fraction of the lowest scores to 1 in the bit set
  • For filterCorrelation > 0 it sets the filterSelectivity fraction of the highest scores

Then it takes a portion of the set bits (1 - |correlation|) and flips them with clear bits drawn randomly from the entire document set.

For example:
Screenshot 2025-01-27 at 5 29 57 PM

I landed on this approach because it provides a smooth-ish dial for correlation where -1 is the worst possible, 0 is totally random, and 1 is the best possible. It also works over any set of scores, no matter the distribution. It is a bit slow because it creates a filter over every query vector and calculates ndoc scores for each. On my machine it takes ~45s to generate the filters for 100,000 docs & 1,000 queries.

@benchaplin
Copy link
Author

Here are some runs with cohere data.

recall  latency (ms)    nDoc  topK  fanout  maxConn  beamWidth  visited  selectivity  correlation   filterType
 1.000         0.700  100000   100      50       16        100     1000         0.01        -1.00  post-filter
 1.000         0.768  100000   100      50       16        100     1000         0.01        -0.50  post-filter
 1.000         0.561  100000   100      50       16        100     1018         0.01         0.00  post-filter
 1.000         0.743  100000   100      50       16        100     1000         0.01         0.50  post-filter
 1.000         0.663  100000   100      50       16        100      999         0.01         1.00  post-filter

 1.000         3.629  100000   100      50       16        100     5000         0.05        -1.00  post-filter
 1.000         4.215  100000   100      50       16        100     5000         0.05        -0.50  post-filter
 1.000         3.616  100000   100      50       16        100     4980         0.05         0.00  post-filter
 0.906         1.411  100000   100      50       16        100     3867         0.05         0.50  post-filter
 0.868         0.783  100000   100      50       16        100     2296         0.05         1.00  post-filter

 1.000         7.349  100000   100      50       16        100    10000         0.10        -1.00  post-filter
 1.000         7.731  100000   100      50       16        100    10000         0.10        -0.50  post-filter
 0.998         7.153  100000   100      50       16        100     9917         0.10         0.00  post-filter
 0.897         1.328  100000   100      50       16        100     3844         0.10         0.50  post-filter
 0.868         0.751  100000   100      50       16        100     2296         0.10         1.00  post-filter

 1.000        17.036  100000   100      50       16        100    25000         0.25        -1.00  post-filter
 0.936         3.656  100000   100      50       16        100     9719         0.25        -0.50  post-filter
 0.923         2.212  100000   100      50       16        100     6611         0.25         0.00  post-filter
 0.896         1.427  100000   100      50       16        100     3698         0.25         0.50  post-filter
 0.868         0.733  100000   100      50       16        100     2296         0.25         1.00  post-filter

 1.000        31.823  100000   100      50       16        100    50000         0.50        -1.00  post-filter
 0.908         1.594  100000   100      50       16        100     4733         0.50        -0.50  post-filter
 0.897         1.287  100000   100      50       16        100     3947         0.50         0.00  post-filter
 0.891         1.103  100000   100      50       16        100     3381         0.50         0.50  post-filter
 0.868         0.809  100000   100      50       16        100     2296         0.50         1.00  post-filter

 0.979        42.900  100000   100      50       16        100    73127         0.75        -1.00  post-filter
 0.882         0.948  100000   100      50       16        100     2789         0.75        -0.50  post-filter
 0.882         0.922  100000   100      50       16        100     2885         0.75         0.00  post-filter
 0.883         0.958  100000   100      50       16        100     2896         0.75         0.50  post-filter
 0.868         0.746  100000   100      50       16        100     2296         0.75         1.00  post-filter

 1.000         0.721  100000   100      50       16        100     1000         0.01        -1.00  pre-filter
 1.000         0.831  100000   100      50       16        100     1000         0.01        -0.50  pre-filter
 1.000         0.532  100000   100      50       16        100     1006         0.01         0.00  pre-filter
 1.000         0.814  100000   100      50       16        100     1000         0.01         0.50  pre-filter
 1.000         0.624  100000   100      50       16        100      999         0.01         1.00  pre-filter

 1.000         3.945  100000   100      50       16        100     5000         0.05        -1.00  pre-filter
 1.000         4.165  100000   100      50       16        100     5000         0.05        -0.50  pre-filter
 1.000         3.370  100000   100      50       16        100     4998         0.05         0.00  pre-filter
 0.908         1.704  100000   100      50       16        100     3871         0.05         0.50  pre-filter
 0.868         0.840  100000   100      50       16        100     2296         0.05         1.00  pre-filter

 1.000         7.543  100000   100      50       16        100    10000         0.10        -1.00  pre-filter
 1.000         7.723  100000   100      50       16        100    10000         0.10        -0.50  pre-filter
 0.998         7.386  100000   100      50       16        100    10076         0.10         0.00  pre-filter
 0.897         1.262  100000   100      50       16        100     3855         0.10         0.50  pre-filter
 0.868         0.756  100000   100      50       16        100     2296         0.10         1.00  pre-filter

 1.000        16.707  100000   100      50       16        100    25000         0.25        -1.00  pre-filter
 0.935         3.577  100000   100      50       16        100     9725         0.25        -0.50  pre-filter
 0.923         2.860  100000   100      50       16        100     6695         0.25         0.00  pre-filter
 0.897         1.222  100000   100      50       16        100     3703         0.25         0.50  pre-filter
 0.868         0.749  100000   100      50       16        100     2296         0.25         1.00  pre-filter

 1.000        32.584  100000   100      50       16        100    50000         0.50        -1.00  pre-filter
 0.907         1.665  100000   100      50       16        100     4728         0.50        -0.50  pre-filter
 0.898         1.261  100000   100      50       16        100     3916         0.50         0.00  pre-filter
 0.891         1.119  100000   100      50       16        100     3385         0.50         0.50  pre-filter
 0.868         0.792  100000   100      50       16        100     2296         0.50         1.00  pre-filter

 0.979        43.833  100000   100      50       16        100    73127         0.75        -1.00  pre-filter
 0.881         0.905  100000   100      50       16        100     2794         0.75        -0.50  pre-filter
 0.881         0.924  100000   100      50       16        100     2869         0.75         0.00  pre-filter
 0.883         1.181  100000   100      50       16        100     2888         0.75         0.50  pre-filter
 0.868         0.743  100000   100      50       16        100     2296         0.75         1.00  pre-filter

@benchaplin benchaplin marked this pull request as ready for review January 27, 2025 23:11
@msokolov
Copy link
Collaborator

I'm confused by this work (sorry, haven't read the Acorn paper). What is meant by correlation? Is it that filtered-out documents are more likely to be near each other when the filter is highly correlated?

@benchaplin
Copy link
Author

@msokolov This is all relative to the query vector. So a "highly correlated filter" is one where the filter-passing vectors are close to the query vector.

@msokolov
Copy link
Collaborator

I see, thanks, that makes sense! (that's a common use case, but sometimes they can definitely also be unrelated)

@benchaplin
Copy link
Author

Actually @msokolov, the question you asked is making me think: filtered vectors being clustered themselves is probably the best simulation of real-world "correlation." Then, of course, it matters how far the query vector is from this cluster.

In my approach, a filterCorrelation close to 1.0 will result in a cluster close to the query. That's great. But the lower filterCorrelation is, the less clustered the filter is - the chosen vectors are just the ones with the lowest score against the query.

A better approach might be: generate a clustered filter, then run query vectors against it and determine a range of "correlation results" based on the the closeness of the query vector and that cluster.

e.g. Generate an arbitrary clustered filter.

  • score(queryVec1, cluster) is high - running this query gives "positively correlated" results
  • score(queryVec2, cluster) is avg - running this query gives "zero correlated" results
  • score(queryVec3, cluster) is low - running this query gives "negatively correlated" results

Hope that makes sense. I'll try to play around with this idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants