Skip to content

Add optimized Neon, AVX2, and AVX 512 float32 vector operations. #130541

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

ChrisHegarty
Copy link
Contributor

@ChrisHegarty ChrisHegarty commented Jul 3, 2025

Add optimized Neon, AVX2, and AVX 512 float32 vector operations.

The changes in this PR give approximately 2x performance increase for float32 vector operations across Linux/ Mac AArch64 and Linux x64 (both AVX2 and AVX 512).

The performance increase comes mostly from being able to score the vectors off-heap (rather than copying on-heap before scoring). The low-level native scorer implementations show only approx ~3-5% improvement over the existing Panama Vector implementation. However, the native scorers allow to score off-heap. The use of Panama Vector with MemorySegments runs into a performance bug in Hotspot, where the bound is not optimally hoisted out of the hot loop (has been reported and acknowledged by OpenJDK) .

There are two high-level layers to scorers:

  1. The low-level vector operations that are pure; cosine, dot product, and square distance
  2. The higher-level Lucene scorers, that do some basic normalisation on the lower-level values, e.g. non-negative, etc.

There are tests and benchmarks at each level. While somewhat duplicated, it's required to ensure that the correct values are returned at the right layers. Additionally, the tests ensure that the optimized low-level vector operations return equivalent values.

@ChrisHegarty ChrisHegarty added :Search Relevance/Vectors Vector search Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch labels Jul 3, 2025
@ChrisHegarty ChrisHegarty added test-windows Trigger CI checks on Windows test-arm Pull Requests that should be tested against arm agents labels Jul 3, 2025
@ldematte ldematte self-requested a review July 3, 2025 15:08
@ChrisHegarty
Copy link
Contributor Author

The micro benchmarks all show approx 2x performance improvement in scorer operations, all platforms. For example:

Apple Mac M2, AArch64
Low-level benchmark results. Compare dotProductLuceneWithCopy to dotProductNativeWithNativeSeg, bigger is better.

Benchmark                                                (size)  Mode  Cnt    Score    Error  Units
JDKVectorFloat32Benchmark.dotProductLucene                 1024  avgt   15   60.448 ±  4.160  ns/op
JDKVectorFloat32Benchmark.dotProductLuceneWithCopy         1024  avgt   15  115.741 ± 11.562  ns/op
JDKVectorFloat32Benchmark.dotProductNativeWithHeapSeg      1024  avgt   15   60.691 ±  4.329  ns/op
JDKVectorFloat32Benchmark.dotProductNativeWithNativeSeg    1024  avgt   15   59.111 ±  0.751  ns/op

Scorer benchmark. Compare dotProductLuceneQuery to dotProductNativeQuery, bigger is better.

Benchmark                                     (dims)   Mode  Cnt  Score   Error   Units
Float32ScorerBenchmark.dotProductLucene         1024  thrpt    5  3.522 ± 0.025  ops/us
Float32ScorerBenchmark.dotProductLuceneQuery    1024  thrpt    5  3.969 ± 0.110  ops/us
Float32ScorerBenchmark.dotProductNative         1024  thrpt    5  7.772 ± 0.060  ops/us
Float32ScorerBenchmark.dotProductNativeQuery    1024  thrpt    5  8.260 ± 0.123  ops/us
Float32ScorerBenchmark.dotProductScalar         1024  thrpt    5  0.602 ± 0.003  ops/us

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Search Relevance/Vectors Vector search Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch test-arm Pull Requests that should be tested against arm agents test-windows Trigger CI checks on Windows v9.2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants