Skip to content

Conversation

benwtrent
Copy link
Member

This adds off-heap scoring for our scalar quantization.

Opening as DRAFT as I still haven't fully tested out the performance characteristics. Opening early for discussion.

@benwtrent benwtrent added this to the 9.12.0 milestone Jun 17, 2024
@benwtrent
Copy link
Member Author

Half-byte is showing up as measurably slower with this change.

Candidate:

0.909	 0.54
0.911	 0.58
0.919	 0.88

baseline:

0.909	 0.30
0.911	 0.33
0.919	 0.47

Full-byte is slightly faster

candidate:

0.962	 0.41
0.966	 0.43
0.978	 0.66

baseline:

0.962	 0.47
0.966	 0.48
0.978	 0.73

@msokolov
Copy link
Contributor

are you reporting indexing times? query times?

@benwtrent
Copy link
Member Author

are you reporting indexing times? query times?

Query times, single segment, 10k docs of 1024 dims.

@benwtrent
Copy link
Member Author

Ok, I double checked, and indeed, half-byte is way slower when reading directly from memory segments instead of reading on heap.
memsegment_vs_baseline.zip

The flamegraphs are wildly different. So much more time is being spent reading from memory segment and then comparing the vectors

candidate (this PR):
image

baseline:

image

@benwtrent
Copy link
Member Author

@ChrisHegarty have you seen a significant performance regression on MemorySegments & JDK22?

Doing some testing, I updated my performance testing for this PR to use JDK22 and now it is WAY slower, more than 2x slower, even for full-byte.

For int7, this branch is marginally faster (20%) with JDK21, but basically 2x slower on JDK22.

I wonder if our off-heap scoring for byte vectors also suffers on JDK22. The quantized scorer for int7 is just using those same methods.

@benwtrent
Copy link
Member Author

To verify it wasn't some weird artifact in my code, I slightly changed it to where my execution path always reads the vectors on-heap and then wraps them in a memorysegment. Now JDK22 performs the same as JDK21 & the current baseline.

Its weird to me that reading from a memory segment onto ByteVector objects would be 2x slower on JDK22 than 21.

Regardless that its already much slower for the int4 case on both jdk 21 & 22.

@ChrisHegarty
Copy link
Contributor

Regardless that its already much slower for the int4 case on both jdk 21 & 22.

@benwtrent I was not aware, lemme take a look.

@kaivalnp
Copy link
Contributor

+1 to this feature

I work on Amazon product search, and in one of our searchers we see a high proportion of CPU cycles within HNSW search being spent in copying quantized vectors to heap:

Screenshot 2025-06-25 at 2 16 43 PM

Perhaps off-heap scoring could help us!

@benwtrent
Copy link
Member Author

@kaivalnp feel free to take my initial work here and dig in deeper.

I haven't benchmarked it recently on later JVMs to figure out why I was experiencing such a weird slowdown when going off heap :/

@kaivalnp
Copy link
Contributor

Thanks @benwtrent! I opened #14863

@benwtrent
Copy link
Member Author

I am gonna close this as work is progressing elsewhere, also, we should just move to off-heap bulk scoring ;)

@benwtrent benwtrent closed this Aug 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants