Bypass HNSW graph building for tiny segments #14963

shubhamvishu · 2025-07-17T16:51:47Z

Description

This change avoids creating a HNSW graph if the segment is small (here we have taken the thresholdfor number of vectors as 10000 based on the conversation here for now).

Some of the points I'm not sure how we would want to go about :

All the tests passes currently since the option to enable the optimization is false by default but setting it to true reveals some failing unit tests which inherently assumes that the HNSW graph is created and KNN search is triggered (do we have some idea of how to bypass those in some good clean way?)
I understand we might want to always keep this optimization on (also less invasive change), but for now in this PR, I made it configurable and enabled it on the KNN format - just to be cautious (wasn't sure if it would not affect back-compact in some unknown way), but happy to make it as default behaviour

TODOs:

Add specific unit tests
Benchmarks (luceneutil)

Closes #13447

benwtrent

Some minor ideas.

benwtrent · 2025-07-17T17:21:14Z

lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsFormat.java

+   * When enabled, segments with fewer than the threshold number of vectors will store only flat
+   * vectors, significantly improving indexing performance for workloads with frequent flushes.
+   */
+  private final boolean bypassTinySegments;


If we allow this to be a parameter, it should be a threshold that refers to the typical k used when querying.

Makes sense

benwtrent · 2025-07-17T17:21:51Z

lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsReader.java

+    boolean doHnsw =
+        knnCollector.k() < scorer.maxOrd()
+            && (bypassTinySegments == false
+                || fieldEntry.size() > Lucene99HnswVectorsFormat.HNSW_GRAPH_THRESHOLD);


The reader should just look to see if there is a graph.

benwtrent · 2025-07-17T17:24:24Z

lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsWriter.java

+      // Check if we need to initialize graph builder for tiny segment optimization
+      if (bypassTinySegments
+          && graphBuilderInitialized == false
+          && node >= Lucene99HnswVectorsFormat.HNSW_GRAPH_THRESHOLD) {


I am thinking we should just use the expectedNodeVisited logic for the given node count (treating like its a "graph") vs. the tiny segment threshold number (which should be used as a k).

I like the idea, I'll change the predicate to use that logic

benwtrent · 2025-07-17T17:26:16Z

I think 10K is likely way too large 10k vector ops vs. 10 * log(10k) ops is a huge difference.

If a user typical searches for 10 nearest neighbors, the graph should be built at around 90 vectors

msokolov · 2025-07-17T20:02:33Z

I can think of a circumstance where we might create small segments that will probably never get searched at all, but will very quickly be merged. In that case we might want to allow a larger threshold?

msokolov · 2025-07-17T20:05:19Z

As far as the tests are concerned I'm confused, wouldn't they fail with a high threshold since we wouldn't build a graph until there are many documents? Maybe I didn't understand the meaning of the threshold though.

shubhamvishu · 2025-07-17T20:28:43Z

@msokolov Actually they do, I had missed setting byPassTinySegments=true locally in one of the constructors so tests didn't exercise that path. Setting it to true does reveal the failing unit tests.

I'll try if there is some clean way to override those checks in the failing unit tests.

jpountz · 2025-07-17T20:52:49Z

I wonder how this interacts with how AbstractKnnVectorQuery does pre-filtering by first passing the filter to KnnVectorsReader#search, and then falling back to an exact search. If the segment doesn't have a HNSW graph, this may effectively start an exact search (via KnnVectorsReader#search) and then abort it to do an exact search again? Or am I missing something?

benwtrent · 2025-07-18T11:31:19Z

I can think of a circumstance where we might create small segments that will probably never get searched at all, but will very quickly be merged. In that case we might want to allow a larger threshold?

I think that is fine. I am thinking of semi-nrt with lots of updates. In cases like that 10k is way too big a default. I think the value should be used as an input to expectedVisitedNodes that takes into account the potential graph size.

Additionally, I would assume users would want to scale quantized formats vs. non-quantized differently (as their vector ops can be much cheaper than floating point ops).

and then abort it to do an exact search again? Or am I missing something?

I would hope the format just does the right thing, and searches everything, knowing that there isn't a graph.

shubhamvishu · 2025-07-18T11:45:35Z

@jpountz Ahh, I see what you are pointing towards and here is what think we could try maybe :

We currently also fallback to exact search after the visitedLimit is breached in HNSW search, so now that same visited limit would be applicable when we are iterating over the docs i.e. net-net approximateKnn (visit V nodes) + exactSearch ~== exactSearch (visit V nodes linearly) + exactSearch which I might not impact the search time?. So one way is to gulp this since we will visit small no. of docs but I agree we can further optimize this path (more on this below points)
We could completely remove the fallback to exactSearch in AbstractKnnVectorQuery and we could relax the check from
- if (knnCollector.earlyTerminated()) to
- if (knnCollector instanceof TimeLimitingKnnCollectorManager.TimeLimitingKnnCollector && ((TimeLimitingKnnCollectorManager.TimeLimitingKnnCollector)knnCollector).shouldExit()) after making TimeLimitingKnnCollector public and exposing shouldExit()
This would ensure we continue the exact search VectorsReader and don't fallback to exactSearch in AbstractKnnVectorQuery. (we can do better maybe, more on it below)
[PROPOSED] Though I think AbstractKnnVectorQuery#exactSearch is better with exact search since it uses a conjunctive DocIdSetIterator rather than iterating on all the docs?. If yes, then for this we could maybe simply add an else if condition in VectorsReader to straightaway overwhelm the collector (forcing its earlyTerminated to return true) and return so it automatically fallsback to best exactSearch impl (I hope that gives us best of both worlds?)

    else if (getGraph(fieldEntry).equals(HnswGraph.EMPTY)) {
      // MakesFallback to exactSearch directly
      knnCollector.incVisitedCount((int) knnCollector.visitLimit() + 1);
    }

Let me know your thoughts or if I'm missing something here. Thanks!

jpountz · 2025-07-18T17:50:25Z

My recommendation would be to move the logic of switching to an exact search when the filter is selective to KnnVectorsReader#search (in a separate PR) so that the file format can make the right decision depending on whether it only has a flat index or something more sophisticated such as a HNSW index. (It doesn't feel completely straight forward since KnnVectorsReader#search may not know how to pull an efficient iterator that matches the same docs at the Bits acceptDocs).

benwtrent · 2025-07-19T12:32:25Z

This makes me wonder if the knn search method should accept a ScorerSupplier and the live docs Bits instead of fully realized bit set that represent both the filter and live docs....

jpountz · 2025-07-19T20:57:49Z

Or some higher-level abstraction that can either be consumed in a random-access fashion (Bits) or sequential (DocIdSetIterator)?

class AcceptDocs {

  /** Random access to the accepted documents. */
  Bits getBits();

  /** Get an iterator of accepted docs. */
  DocIdSetIterator getIterator();

  /** Return an approximation of the number of accepted documents. */
  long cost();
}

vigyasharma

How do we handle backward compatibility in this change? I noticed we don't write any metadata (e.g. in FieldEntry) about bypassTinySegments or whether a graph was built or not. The flag gets configured when the format is initialized from the codec.

What happens if I create an index with bypassTinySegments=true, but later read it in an application with the flag set to false? I think we need to persist information about whether graph was built for the segment.

vigyasharma · 2025-07-20T07:19:11Z

lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsWriter.java

+      this.bypassTinySegments = bypassTinySegments;
+      this.flatFieldVectorsWriter = Objects.requireNonNull(flatFieldVectorsWriter);
+      if (bypassTinySegments) {
+        this.bufferedVectors = new ArrayList<>();


Since we only store upto HNSW_GRAPH_THRESHOLD no. of vectors, beyond which we resume the regular flow of adding them to the graph, we could use an array here instead of an ArrayList?

vigyasharma · 2025-07-20T07:21:08Z

lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsWriter.java

+        replayBufferedVectors();
+        bufferedVectors.clear();
+      }
+      if (hnswGraphBuilder != null) {


Does hnswGraphBuilder != null do the same thing as graphBuilderInisialized ? if so, do we need graphBuilderInisialized ?

vigyasharma · 2025-07-20T20:28:00Z

I think we need to persist information about whether graph was built for the segment.

Maybe we could use one of the existing fields that describe the graph. Like set numLevels=0 when there is no graph (otherwise it would at least be 1)?

Bypass HNSW graph building for tiny segments

461a053

github-project-automation bot added this to OpenSearch Lucene & Core Performance Tracking Jul 17, 2025

github-project-automation bot moved this to Open in OpenSearch Lucene & Core Performance Tracking Jul 17, 2025

github-actions bot added the module:core/codecs label Jul 17, 2025

shubhamvishu mentioned this pull request Jul 17, 2025

Explore bypassing HNSW graph building for tiny segments #13447

Open

benwtrent reviewed Jul 17, 2025

View reviewed changes

vigyasharma reviewed Jul 20, 2025

View reviewed changes

Bypass HNSW graph building for tiny segments #14963

Are you sure you want to change the base?

Bypass HNSW graph building for tiny segments #14963

Conversation

shubhamvishu commented Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

benwtrent left a comment

Choose a reason for hiding this comment

Uh oh!

benwtrent Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

shubhamvishu Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

benwtrent Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

benwtrent Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

shubhamvishu Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

benwtrent commented Jul 17, 2025

Uh oh!

msokolov commented Jul 17, 2025

Uh oh!

msokolov commented Jul 17, 2025

Uh oh!

shubhamvishu commented Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jpountz commented Jul 17, 2025

Uh oh!

benwtrent commented Jul 18, 2025

Uh oh!

shubhamvishu commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jpountz commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benwtrent commented Jul 19, 2025

Uh oh!

jpountz commented Jul 19, 2025

Uh oh!

vigyasharma left a comment

Choose a reason for hiding this comment

Uh oh!

vigyasharma Jul 20, 2025

Choose a reason for hiding this comment

Uh oh!

vigyasharma Jul 20, 2025

Choose a reason for hiding this comment

Uh oh!

vigyasharma commented Jul 20, 2025

Uh oh!

Uh oh!

shubhamvishu commented Jul 17, 2025 •

edited

Loading

shubhamvishu commented Jul 17, 2025 •

edited

Loading

shubhamvishu commented Jul 18, 2025 •

edited

Loading

jpountz commented Jul 18, 2025 •

edited

Loading