Add new Acorn-esque filtered HNSW search heuristic #14160

benwtrent · 2025-01-22T19:08:21Z

This is a continuation and completion of the work started by @benchaplin in #14085

The algorithm is fairly simple:

Only score and then explore vectors that actually match the filtering criteria
Since this will make the graph even sparser, the search spread is increased to also include the candidate's neighbor neighbors (e.g. generally maxConn * maxConn exploration)
Additionally, even more scored candidates for a given NSW are considered to combat the increased sparsity

Some of the changes to the baseline Acorn algorithm are:

There is some general threshold of filtering that bypasses this algorithm altogether. Early benchmarking seems to indicate that this might be around 50%, but honestly, its not fully convincing...
The number of additional neighbors explored is predicated on the percentage of the immediate neighborhood that is filtered out
Only look at the extended neighbors if less than 90% of the current neighborhood matches the filter.

Here are some numbers for 1M vectors, float32 and then int4 quantized.

https://docs.google.com/spreadsheets/d/1GqD7Jw42IIqimr2nB78fzEfOohrcBlJzOlpt0NuUVDQ/edit?gid=163290867#gid=163290867

Something I am unsure about:

How to expose this setting to the users? While I am not a fan of more configuration at query time, the behavior seems different enough to justify it.

TODO:

More manual testing over more datasets
Add some unit and functional tests.

closes: #13940

benchaplin · 2025-01-23T19:59:14Z

lucene/core/src/java/org/apache/lucene/util/hnsw/FilteredHnswGraphSearcher.java

+      RandomVectorScorer scorer,
+      KnnCollector knnCollector,
+      HnswGraph graph,
+      int filterSize,


FYI missed a javadoc param for filterSize

benchaplin · 2025-01-23T20:10:07Z

lucene/core/src/java/org/apache/lucene/util/hnsw/FilteredHnswGraphSearcher.java

+   *     exceeded
+   * @throws IOException When accessing the vector fails
+   */
+  private int findBestEntryPoint(RandomVectorScorer scorer, KnnCollector collector)


Thoughts on abstracting this since it's identical (minus source of graph) to HnswGraphSearcher::findBestEntryPoint? I suppose we might want to investigate multiple entry points in the future so maybe the duplicate code will be gone soon.

I don't mind having two pieces of code vs. the wrong abstraction. We can refactor if we wish in a separate PR later. But if only two things are copy-pasting code, its probably ok.

benwtrent · 2025-01-24T14:41:29Z

@msokolov I wonder your opinion here?

Do you think the behavior change/result change is worth waiting for a major? I do think folks should be able to use this now, but be able to opt out.

Another option I could think of is injecting a parameter or something directly into SPI loading for the hnsw vector readers. But I am not 100% sure how to do that. It does seem like it should be something that is a "global" configuration for a given Lucene instance instead of one that is provided at query time.

benchaplin · 2025-01-24T20:52:00Z

lucene/core/src/java/org/apache/lucene/util/hnsw/FilteredHnswGraphSearcher.java

+            }
+            if (acceptOrds.get(friendOfAFriendOrd)) {
+              toScore.add(friendOfAFriendOrd);
+            }


I was expecting else { toExplore.add(friendOfAFriendOrd) } here for >2-hop exploration. Did you drop that idea after performance testing?

@benchaplin yeah, going further than 2 hops didn't seem to improve much. We can adjust it later, but it didn't improve anything.

benchaplin · 2025-01-24T20:58:41Z

lucene/core/src/java/org/apache/lucene/util/hnsw/FilteredHnswGraphSearcher.java

+      int friendOrd;
+      while ((friendOrd = graph.nextNeighbor()) != NO_MORE_DOCS && toScore.isFull() == false) {
+        assert friendOrd < size : "friendOrd=" + friendOrd + "; size=" + size;
+        if (visited.get(friendOrd) || explorationVisited.getAndSet(friendOrd)) {


What's the purpose of explorationVisited?

We need to separate out the explored for 2-hop candidates vs immediate neighborhoods. If we did an expanded search for a candidate, we don't want to search it more during expanded search.

However, if that candidate matches, we DON'T want to skip it (e.g. during regular search).

Maybe we can collapse the two visited sets together, but two of them seemed OK to me.

Ah, yes that makes sense. So for immediate neighborhoods wouldn't we just want to do:

if (visited.setAndGet(friendOrd)) { continue; }

?

benwtrent · 2025-01-30T16:22:24Z

I ran this over the "nightly" dataset (8M 768 dim vectors). No force merging. I think this is the nightly behavior. I ran over various filter criteria (I think nightly is 5%).

BASELINE

recall  latency (ms)     nDoc  topK  fanout  visited  selectivity
 1.000       110.216  8000000   100      50    79846        0.010
 0.982       137.185  8000000   100      50   215393        0.050
 0.974        85.933  8000000   100      50   144953        0.100
 0.965        73.476  8000000   100      50    86333        0.200
 0.958        58.347  8000000   100      50    64055        0.300
 0.952        34.021  8000000   100      50    51634        0.400
 0.944        32.818  8000000   100      50    43643        0.500
 0.940        29.538  8000000   100      50    38200        0.600
 0.936        26.965  8000000   100      50    34205        0.700
 0.930        25.453  8000000   100      50    30989        0.800
 0.926        23.585  8000000   100      50    28482        0.900
 0.924        23.926  8000000   100      50    27318        0.950
 0.922        23.306  8000000   100      50    26481        0.990

recall  latency (ms)     nDoc  topK  fanout  visited  selectivity
 0.640        28.972  8000000   100      50    10709        0.010
 0.855        34.103  8000000   100      50    20845        0.050
 0.908        37.990  8000000   100      50    36339        0.100
 0.922        47.513  8000000   100      50    54472        0.200
 0.903        46.094  8000000   100      50    56451        0.300
 0.894        41.164  8000000   100      50    52235        0.400
 0.870        30.850  8000000   100      50    36989        0.500
 0.881        28.043  8000000   100      50    34102        0.600
 0.896        27.725  8000000   100      50    33346        0.700
 0.904        25.472  8000000   100      50    31135        0.800
 0.913        23.670  8000000   100      50    26715        0.900
 0.918        23.148  8000000   100      50    26193        0.950
 0.922        22.982  8000000   100      50    26425        0.990

The goal is generally "higher recall with lower visited", a nice single value to show this would be recall/visited, so as visited reduces or recall increases, that value is "higher" so higher is better.

I graphed this ratio (multiplying by 100_000 to make the values saner looking)

So, this shows on nightly, the ratio is significantly improved, by as much as 5x.

I am currently force merging and attempting to re run.

Here is some more data for candidate only at 0.05 filtering with increasing fanout:

recall  latency (ms)     nDoc  topK  fanout  visited  selectivity
 0.855        29.257  8000000   100      50    20845        0.050
 0.859        30.215  8000000   100      60    21514        0.050
 0.862        31.189  8000000   100      70    22134        0.050
 0.866        31.998  8000000   100      80    22718        0.050
 0.868        32.896  8000000   100      90    23294        0.050
 0.871        33.569  8000000   100     100    23877        0.050
 0.873        29.677  8000000   100     110    24447        0.050
 0.875        34.983  8000000   100     120    24978        0.050
 0.877        34.644  8000000   100     130    25494        0.050
 0.879        36.034  8000000   100     140    26015        0.050
 0.881        36.557  8000000   100     150    26533        0.050
 0.883        36.708  8000000   100     160    27034        0.050
 0.884        36.946  8000000   100     170    27534        0.050
 0.886        38.691  8000000   100     180    27999        0.050
 0.888        39.257  8000000   100     190    28503        0.050
 0.890        39.152  8000000   100     200    28955        0.050
 0.891        40.726  8000000   100     210    29453        0.050
 0.892        41.062  8000000   100     220    29895        0.050
 0.893        40.994  8000000   100     230    30319        0.050
 0.895        41.713  8000000   100     240    30736        0.050
 0.896        42.321  8000000   100     250    31180        0.050

benchaplin and others added 16 commits December 20, 2024 13:09

Implement ACORN-1 search for HNSW

046178b

iter

69de431

iter

b5f53cf

iter

280887d

more fixes

e17ebe4

add new searcher to simplify interactions

784ad5f

iter

85879dd

iter

b32a5f8

iter

7e0d983

iter

c16c667

iter

cd21582

iter

54b6dca

iter

83d5003

Merge remote-tracking branch 'upstream/main' into acorn_search

5b7c5a1

iter

23dfed0

fixing javadocs

94c1777

benwtrent added this to the 10.2.0 milestone Jan 22, 2025

adding changes

a0d29e0

benchaplin reviewed Jan 23, 2025

View reviewed changes

benwtrent added 3 commits January 23, 2025 15:23

adjust interface

881db2d

iter

1c8eec3

iter

a34b2f5

benchaplin reviewed Jan 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new Acorn-esque filtered HNSW search heuristic #14160

Add new Acorn-esque filtered HNSW search heuristic #14160

benwtrent commented Jan 22, 2025

benchaplin Jan 23, 2025 •

edited

Loading

benchaplin Jan 23, 2025

benwtrent Jan 23, 2025

benwtrent commented Jan 24, 2025

benchaplin Jan 24, 2025

benwtrent Jan 24, 2025

benchaplin Jan 24, 2025

benwtrent Jan 24, 2025

benchaplin Jan 24, 2025

benwtrent commented Jan 30, 2025

Add new Acorn-esque filtered HNSW search heuristic #14160

Are you sure you want to change the base?

Add new Acorn-esque filtered HNSW search heuristic #14160

Conversation

benwtrent commented Jan 22, 2025

benchaplin Jan 23, 2025 • edited Loading

Choose a reason for hiding this comment

benchaplin Jan 23, 2025

Choose a reason for hiding this comment

benwtrent Jan 23, 2025

Choose a reason for hiding this comment

benwtrent commented Jan 24, 2025

benchaplin Jan 24, 2025

Choose a reason for hiding this comment

benwtrent Jan 24, 2025

Choose a reason for hiding this comment

benchaplin Jan 24, 2025

Choose a reason for hiding this comment

benwtrent Jan 24, 2025

Choose a reason for hiding this comment

benchaplin Jan 24, 2025

Choose a reason for hiding this comment

benwtrent commented Jan 30, 2025

benchaplin Jan 23, 2025 •

edited

Loading