-
Notifications
You must be signed in to change notification settings - Fork 25.3k
Adjust IVF fixup phase to sometimes bypass some of the neighborhood calculations #130490
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adjust IVF fixup phase to sometimes bypass some of the neighborhood calculations #130490
Conversation
Pinging @elastic/es-search-relevance (Team:Search Relevance) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
neighborhoods.set(i, neighbors); | ||
float maxIntraDistance = queue.consumeNodesWithWorstScore(neighbors, scores); | ||
// Sort neighbors by their score | ||
for (int j = 0; j < neighborCount; j++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not populate the array using the pop method of the priority queue?
for (int j = neighborCount - 1; j >= 0; j--) {
neighbors[j] = queue.pop();
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@iverase let me benchmark this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just left a recommendation to simplify the code. The approach makes sense to me I have observed the same behaviour as explained in the description when running it over my local tests.
continue; | ||
} | ||
// consume the queue into the neighbors array and get the maximum intra-cluster distance | ||
int[] neighbors = new int[queue.size()]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks much nicer now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and faster!
During the fixup phase, we compare the vector against every neighbor in the cluster neighborhood, no matter what. This seems pretty wasteful, especially for tightly clustered set of neighbors.
This change adjusts the fix up phase to only check for assignment if the currently assigned centroid is worse than the maximum intra-distance of the neighborhood.
This further reduces index time at no perceivable recall loss. I ran over 3 data sets, multi-segment and force merged.
Additionally, I noticed that we seemed to compute neighborhoods and use those calculations even when the total number of clusters is fewer than the neighborhood size. I adjusted this logic and we only compute the neighborhoods when the number of clusters is larger than the configured fixup neighborhood size.
All in all, this gives us about 5-15% index performance boost with no substantial drop in recall (the most I saw across all my runs was 0.01)
@iverase @john-wagster let me know what y'all think