Skip to content

Adjust IVF fixup phase to sometimes bypass some of the neighborhood calculations #130490

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jul 3, 2025

Conversation

benwtrent
Copy link
Member

During the fixup phase, we compare the vector against every neighbor in the cluster neighborhood, no matter what. This seems pretty wasteful, especially for tightly clustered set of neighbors.

This change adjusts the fix up phase to only check for assignment if the currently assigned centroid is worse than the maximum intra-distance of the neighborhood.

This further reduces index time at no perceivable recall loss. I ran over 3 data sets, multi-segment and force merged.

Additionally, I noticed that we seemed to compute neighborhoods and use those calculations even when the total number of clusters is fewer than the neighborhood size. I adjusted this logic and we only compute the neighborhoods when the number of clusters is larger than the configured fixup neighborhood size.

All in all, this gives us about 5-15% index performance boost with no substantial drop in recall (the most I saw across all my runs was 0.01)

@iverase @john-wagster let me know what y'all think

@benwtrent benwtrent requested a review from iverase July 2, 2025 20:00
@benwtrent benwtrent requested a review from john-wagster July 2, 2025 20:00
@elasticsearchmachine elasticsearchmachine added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Jul 2, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-relevance (Team:Search Relevance)

@benwtrent benwtrent changed the title Feature/improve ivf index time Adjust IVF fixup phase to sometimes bypass some of the neighborhood calculations Jul 2, 2025
Copy link
Contributor

@john-wagster john-wagster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

neighborhoods.set(i, neighbors);
float maxIntraDistance = queue.consumeNodesWithWorstScore(neighbors, scores);
// Sort neighbors by their score
for (int j = 0; j < neighborCount; j++) {
Copy link
Contributor

@iverase iverase Jul 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not populate the array using the pop method of the priority queue?

            for (int j = neighborCount - 1; j >= 0; j--) {
                neighbors[j] = queue.pop();
            }

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iverase let me benchmark this

Copy link
Contributor

@iverase iverase left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just left a recommendation to simplify the code. The approach makes sense to me I have observed the same behaviour as explained in the description when running it over my local tests.

@benwtrent benwtrent added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Jul 3, 2025
continue;
}
// consume the queue into the neighbors array and get the maximum intra-cluster distance
int[] neighbors = new int[queue.size()];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks much nicer now

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and faster!

@benwtrent benwtrent merged commit e5da80f into elastic:main Jul 3, 2025
31 of 32 checks passed
@benwtrent benwtrent deleted the feature/improve-ivf-index-time branch July 3, 2025 15:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) >non-issue :Search Relevance/Vectors Vector search Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch v9.2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants