Add correlation parameter for KNN performance tests #330

benchaplin · 2025-01-17T22:48:15Z

We've been investigating implementing the "ACORN" filtered KNN search algorithm in Lucene. One aspect of filtered search which luceneutil can't test is "correlation" of the filter (see Figure 2 in the ACORN paper).

So I've implemented a filterCorrelation parameter to do just that. It works by generating a unique filter for each query vector. filterSelectivity is preserved exactly, and the given filterCorrelation is used to dial "how much" correlation there is.

Here's an idea of how it works:

For filterCorrelation < 0, it starts by setting the filterSelectivity fraction of the lowest scores to 1 in the bit set
For filterCorrelation > 0 it sets the filterSelectivity fraction of the highest scores

Then it takes a portion of the set bits (1 - |correlation|) and flips them with clear bits drawn randomly from the entire document set.

For example:

I landed on this approach because it provides a smooth-ish dial for correlation where -1 is the worst possible, 0 is totally random, and 1 is the best possible. It also works over any set of scores, no matter the distribution. It is a bit slow because it creates a filter over every query vector and calculates ndoc scores for each. On my machine it takes ~45s to generate the filters for 100,000 docs & 1,000 queries.

benchaplin · 2025-01-27T23:11:24Z

Here are some runs with cohere data.

recall  latency (ms)    nDoc  topK  fanout  maxConn  beamWidth  visited  selectivity  correlation   filterType
 1.000         0.700  100000   100      50       16        100     1000         0.01        -1.00  post-filter
 1.000         0.768  100000   100      50       16        100     1000         0.01        -0.50  post-filter
 1.000         0.561  100000   100      50       16        100     1018         0.01         0.00  post-filter
 1.000         0.743  100000   100      50       16        100     1000         0.01         0.50  post-filter
 1.000         0.663  100000   100      50       16        100      999         0.01         1.00  post-filter

 1.000         3.629  100000   100      50       16        100     5000         0.05        -1.00  post-filter
 1.000         4.215  100000   100      50       16        100     5000         0.05        -0.50  post-filter
 1.000         3.616  100000   100      50       16        100     4980         0.05         0.00  post-filter
 0.906         1.411  100000   100      50       16        100     3867         0.05         0.50  post-filter
 0.868         0.783  100000   100      50       16        100     2296         0.05         1.00  post-filter

 1.000         7.349  100000   100      50       16        100    10000         0.10        -1.00  post-filter
 1.000         7.731  100000   100      50       16        100    10000         0.10        -0.50  post-filter
 0.998         7.153  100000   100      50       16        100     9917         0.10         0.00  post-filter
 0.897         1.328  100000   100      50       16        100     3844         0.10         0.50  post-filter
 0.868         0.751  100000   100      50       16        100     2296         0.10         1.00  post-filter

 1.000        17.036  100000   100      50       16        100    25000         0.25        -1.00  post-filter
 0.936         3.656  100000   100      50       16        100     9719         0.25        -0.50  post-filter
 0.923         2.212  100000   100      50       16        100     6611         0.25         0.00  post-filter
 0.896         1.427  100000   100      50       16        100     3698         0.25         0.50  post-filter
 0.868         0.733  100000   100      50       16        100     2296         0.25         1.00  post-filter

 1.000        31.823  100000   100      50       16        100    50000         0.50        -1.00  post-filter
 0.908         1.594  100000   100      50       16        100     4733         0.50        -0.50  post-filter
 0.897         1.287  100000   100      50       16        100     3947         0.50         0.00  post-filter
 0.891         1.103  100000   100      50       16        100     3381         0.50         0.50  post-filter
 0.868         0.809  100000   100      50       16        100     2296         0.50         1.00  post-filter

 0.979        42.900  100000   100      50       16        100    73127         0.75        -1.00  post-filter
 0.882         0.948  100000   100      50       16        100     2789         0.75        -0.50  post-filter
 0.882         0.922  100000   100      50       16        100     2885         0.75         0.00  post-filter
 0.883         0.958  100000   100      50       16        100     2896         0.75         0.50  post-filter
 0.868         0.746  100000   100      50       16        100     2296         0.75         1.00  post-filter

 1.000         0.721  100000   100      50       16        100     1000         0.01        -1.00  pre-filter
 1.000         0.831  100000   100      50       16        100     1000         0.01        -0.50  pre-filter
 1.000         0.532  100000   100      50       16        100     1006         0.01         0.00  pre-filter
 1.000         0.814  100000   100      50       16        100     1000         0.01         0.50  pre-filter
 1.000         0.624  100000   100      50       16        100      999         0.01         1.00  pre-filter

 1.000         3.945  100000   100      50       16        100     5000         0.05        -1.00  pre-filter
 1.000         4.165  100000   100      50       16        100     5000         0.05        -0.50  pre-filter
 1.000         3.370  100000   100      50       16        100     4998         0.05         0.00  pre-filter
 0.908         1.704  100000   100      50       16        100     3871         0.05         0.50  pre-filter
 0.868         0.840  100000   100      50       16        100     2296         0.05         1.00  pre-filter

 1.000         7.543  100000   100      50       16        100    10000         0.10        -1.00  pre-filter
 1.000         7.723  100000   100      50       16        100    10000         0.10        -0.50  pre-filter
 0.998         7.386  100000   100      50       16        100    10076         0.10         0.00  pre-filter
 0.897         1.262  100000   100      50       16        100     3855         0.10         0.50  pre-filter
 0.868         0.756  100000   100      50       16        100     2296         0.10         1.00  pre-filter

 1.000        16.707  100000   100      50       16        100    25000         0.25        -1.00  pre-filter
 0.935         3.577  100000   100      50       16        100     9725         0.25        -0.50  pre-filter
 0.923         2.860  100000   100      50       16        100     6695         0.25         0.00  pre-filter
 0.897         1.222  100000   100      50       16        100     3703         0.25         0.50  pre-filter
 0.868         0.749  100000   100      50       16        100     2296         0.25         1.00  pre-filter

 1.000        32.584  100000   100      50       16        100    50000         0.50        -1.00  pre-filter
 0.907         1.665  100000   100      50       16        100     4728         0.50        -0.50  pre-filter
 0.898         1.261  100000   100      50       16        100     3916         0.50         0.00  pre-filter
 0.891         1.119  100000   100      50       16        100     3385         0.50         0.50  pre-filter
 0.868         0.792  100000   100      50       16        100     2296         0.50         1.00  pre-filter

 0.979        43.833  100000   100      50       16        100    73127         0.75        -1.00  pre-filter
 0.881         0.905  100000   100      50       16        100     2794         0.75        -0.50  pre-filter
 0.881         0.924  100000   100      50       16        100     2869         0.75         0.00  pre-filter
 0.883         1.181  100000   100      50       16        100     2888         0.75         0.50  pre-filter
 0.868         0.743  100000   100      50       16        100     2296         0.75         1.00  pre-filter

msokolov · 2025-01-29T13:53:06Z

I'm confused by this work (sorry, haven't read the Acorn paper). What is meant by correlation? Is it that filtered-out documents are more likely to be near each other when the filter is highly correlated?

benchaplin · 2025-01-29T15:03:31Z

@msokolov This is all relative to the query vector. So a "highly correlated filter" is one where the filter-passing vectors are close to the query vector.

msokolov · 2025-01-29T15:56:52Z

I see, thanks, that makes sense! (that's a common use case, but sometimes they can definitely also be unrelated)

benchaplin · 2025-01-31T18:39:16Z

Actually @msokolov, the question you asked is making me think: filtered vectors being clustered themselves is probably the best simulation of real-world "correlation." Then, of course, it matters how far the query vector is from this cluster.

In my approach, a filterCorrelation close to 1.0 will result in a cluster close to the query. That's great. But the lower filterCorrelation is, the less clustered the filter is - the chosen vectors are just the ones with the lowest score against the query.

A better approach might be: generate a clustered filter, then run query vectors against it and determine a range of "correlation results" based on the the closeness of the query vector and that cluster.

e.g. Generate an arbitrary clustered filter.

score(queryVec1, cluster) is high - running this query gives "positively correlated" results
score(queryVec2, cluster) is avg - running this query gives "zero correlated" results
score(queryVec3, cluster) is low - running this query gives "negatively correlated" results

Hope that makes sense. I'll try to play around with this idea.

benchaplin added 9 commits January 14, 2025 15:28

Implement per-query filters for max/min correlation

9933741

Add dynamic correlation generation alg

054319e

Add tests, define iter/batch constants depending on n

3ed53ac

Merge branch 'main' into knn_correlation

c314e24

Fix correlation summary log

e1314ea

Add correlation to python report

67a7c38

Remove randomness, shift filter as a block

8a70e93

Adjust approach

4148a8d

Remove tests, throw exception for too big ndoc param

de8d74c

benchaplin marked this pull request as ready for review January 27, 2025 23:11

Use getAndClear

df188f6

benchaplin mentioned this pull request Jan 31, 2025

Add new Acorn-esque filtered HNSW search heuristic apache/lucene#14160

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add correlation parameter for KNN performance tests #330

Add correlation parameter for KNN performance tests #330

benchaplin commented Jan 17, 2025 •

edited

Loading

benchaplin commented Jan 27, 2025

msokolov commented Jan 29, 2025

benchaplin commented Jan 29, 2025

msokolov commented Jan 29, 2025

benchaplin commented Jan 31, 2025

Add correlation parameter for KNN performance tests #330

Are you sure you want to change the base?

Add correlation parameter for KNN performance tests #330

Conversation

benchaplin commented Jan 17, 2025 • edited Loading

benchaplin commented Jan 27, 2025

msokolov commented Jan 29, 2025

benchaplin commented Jan 29, 2025

msokolov commented Jan 29, 2025

benchaplin commented Jan 31, 2025

benchaplin commented Jan 17, 2025 •

edited

Loading