-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Bypass HNSW graph building for tiny segments #14963
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor ideas.
* When enabled, segments with fewer than the threshold number of vectors will store only flat | ||
* vectors, significantly improving indexing performance for workloads with frequent flushes. | ||
*/ | ||
private final boolean bypassTinySegments; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we allow this to be a parameter, it should be a threshold that refers to the typical k
used when querying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense
boolean doHnsw = | ||
knnCollector.k() < scorer.maxOrd() | ||
&& (bypassTinySegments == false | ||
|| fieldEntry.size() > Lucene99HnswVectorsFormat.HNSW_GRAPH_THRESHOLD); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reader should just look to see if there is a graph.
// Check if we need to initialize graph builder for tiny segment optimization | ||
if (bypassTinySegments | ||
&& graphBuilderInitialized == false | ||
&& node >= Lucene99HnswVectorsFormat.HNSW_GRAPH_THRESHOLD) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am thinking we should just use the expectedNodeVisited
logic for the given node count (treating like its a "graph") vs. the tiny segment threshold number (which should be used as a k
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea, I'll change the predicate to use that logic
I think 10K is likely way too large 10k vector ops vs. If a user typical searches for 10 nearest neighbors, the graph should be built at around 90 vectors |
I can think of a circumstance where we might create small segments that will probably never get searched at all, but will very quickly be merged. In that case we might want to allow a larger threshold? |
As far as the tests are concerned I'm confused, wouldn't they fail with a high threshold since we wouldn't build a graph until there are many documents? Maybe I didn't understand the meaning of the threshold though. |
@msokolov Actually they do, I had missed setting I'll try if there is some clean way to override those checks in the failing unit tests. |
I wonder how this interacts with how |
I think that is fine. I am thinking of semi-nrt with lots of updates. In cases like that 10k is way too big a default. I think the value should be used as an input to expectedVisitedNodes that takes into account the potential graph size. Additionally, I would assume users would want to scale quantized formats vs. non-quantized differently (as their vector ops can be much cheaper than floating point ops).
I would hope the format just does the right thing, and searches everything, knowing that there isn't a graph. |
@jpountz Ahh, I see what you are pointing towards and here is what think we could try maybe :
Let me know your thoughts or if I'm missing something here. Thanks! |
My recommendation would be to move the logic of switching to an exact search when the filter is selective to |
This makes me wonder if the knn search method should accept a ScorerSupplier and the live docs Bits instead of fully realized bit set that represent both the filter and live docs.... |
Or some higher-level abstraction that can either be consumed in a random-access fashion ( class AcceptDocs {
/** Random access to the accepted documents. */
Bits getBits();
/** Get an iterator of accepted docs. */
DocIdSetIterator getIterator();
/** Return an approximation of the number of accepted documents. */
long cost();
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we handle backward compatibility in this change? I noticed we don't write any metadata (e.g. in FieldEntry
) about bypassTinySegments
or whether a graph was built or not. The flag gets configured when the format is initialized from the codec.
What happens if I create an index with bypassTinySegments=true
, but later read it in an application with the flag set to false? I think we need to persist information about whether graph was built for the segment.
this.bypassTinySegments = bypassTinySegments; | ||
this.flatFieldVectorsWriter = Objects.requireNonNull(flatFieldVectorsWriter); | ||
if (bypassTinySegments) { | ||
this.bufferedVectors = new ArrayList<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we only store upto HNSW_GRAPH_THRESHOLD
no. of vectors, beyond which we resume the regular flow of adding them to the graph, we could use an array here instead of an ArrayList?
replayBufferedVectors(); | ||
bufferedVectors.clear(); | ||
} | ||
if (hnswGraphBuilder != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does hnswGraphBuilder != null
do the same thing as graphBuilderInisialized
? if so, do we need graphBuilderInisialized
?
Maybe we could use one of the existing fields that describe the graph. Like set |
Description
This change avoids creating a HNSW graph if the segment is small (here we have taken the thresholdfor number of vectors as
10000
based on the conversation here for now).Some of the points I'm not sure how we would want to go about :
false
by default but setting it totrue
reveals some failing unit tests which inherently assumes that the HNSW graph is created and KNN search is triggered (do we have some idea of how to bypass those in some good clean way?)TODOs:
Closes #13447