-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a Faiss codec for KNN searches #14178
base: main
Are you sure you want to change the base?
Conversation
Description
TODOs
Long-term considerationsUsing a C/CPP shared library makes it difficult to debug and profile native code |
UsageThe new format can be used by:
For example, creating an HNSW index with new FaissKnnVectorsFormatProvider("HNSW32", "efConstruction=200"); Adding PQ to this index is as simple as: new FaissKnnVectorsFormatProvider("HNSW32_PQ50", "efConstruction=200"); Reordering the final results using exact distances is as simple as: new FaissKnnVectorsFormatProvider("HNSW32_PQ50,RFlat", "efConstruction=200"); ..and so on BenchmarksI built this PR using Java 22 and benchmarked it using Uses 300d documents and vectors generated using: ./gradlew vector-300 from the luceneutil package Lucene:
Faiss:
Corresponding format used: // efSearch is set as topK + fanout
new FaissKnnVectorsFormatProvider("HNSW32", "efConstruction=200,efSearch=150"); This is a single segment search with no deletes, but we see ~88% index-time speedup and ~26% search-time speedup! Used a fairly powerful machine ( One other thing to note is that the Faiss C_API does not use vectorized (AVX2 / AVX512 / SVE) instructions -- so we can squeeze some more performance out of it by building optimized versions of the library |
Some very interesting numbers @kaivalnp Almost 10x indexing throughput improvement tells me we are doing something silly in Lucene. Especially since the search time is only about 25% better. The search time numbers make me wonder if the differential is mainly that reads the floats onto heap. Maybe it can be just as fast by not reading the floating point vectors on to heap and doing memory segment stuff (which gets tricky, but not impossible). Does FAISS index needs the "flat" vector storage at all? I thought FAISS gave direct access to the vector values based on ordinals? Or do you have to index it in a special way? I can try to replicate the performance numbers when I can. One thing that stands out to me, is that during merge, all vectors are buffered onto heap, which is pretty dang expensive :/ |
- Fix javadocs - Fallback to null if underlying format is not present
Interesting, do we have a Lucene PR that explores it?
Faiss does not need (and does not use) the flat vectors at search time, and it does provide access to underlying vectors based on ordinals -- but these are "reconstructed" vectors and may be lossy in some indexes (for example PQ) Because of this loss of information, vectors would keep getting more approximate with each merge (where we read back all vectors and create a fresh index) -- which is not desirable We could technically store the raw vectors within Faiss (say another flat index) -- but exposing them via a
+1 -- I'd like to reduce memory pressure in the future. One thing @msokolov pointed out offline is that we're using the flat format anyways -- we could flush that first and read back the vectors (but this time disk-backed -- so we reduce the double copy in memory). I'm not sure if the current APIs allow it, but I'll try to address this The least memory usage would be by adding vectors one-by-one to the Faiss index and not store them on heap at all, but I suspect this would hurt indexing performance due to multiple native calls (one per document). We could possibly index vectors in "batches" as a middle ground Also, the "train" call requires all training vectors to be passed at once -- so this is another bottleneck (i.e. we need to keep all training vectors in memory)
Thanks, this would be super helpful! |
I did not test this specific integration but Faiss is multithreaded on bulk training, adding and searching so we have to be careful when comparing the result. |
Ah nice catch, the number of threads used by both may be different.. I'm not sure how many threads were used by Faiss above, but the number of threads used by Lucene are specified here (I didn't change these) I set Lucene:
Faiss:
Not as high as 10x anymore, but it is still ~3x faster
This was set to |
Not so easy ;) See the force merge time for Faiss (41.44 s). The force merge is the time it took to merge the created segments into 1 (the final number of segment for your experiment). So the total time is index + force merge time and in these runs it seems in the same ballpark for both. |
Ah I see :)
Does it mean that the Faiss benchmark created a larger number of segments initially, which had to be merged into 1? If so, would it mean that we created smaller indexes (i.e. HNSW graphs) for the original segments, and then had to recreate a larger one from scratch during force merge? (because we start fresh during merges) I wonder if the numbers are comparable in this case, I'll try to dig deeper on the pre-merge segment count mismatch between the two.. |
@kaivalnp the force-merge time indicates that during merge to a single segment, the index is being rebuilt from various segments. I would think that the For this benchmark there are a couple of options:
One other concern. There are two types of "multi-threading" in the indexing. There are the number of threads doing the indexing (e.g. writing to an indexer and creating a segment) and the number of threads used when building a graph. For simplicity, I would reduce the number of indexing threads to 1, faiss threads to 1, and merge workers to 1. Once we have numbers of the cost of running on a single thread, then we can see how adjusting these allows one to pull ahead of the other. |
Really, |
@mikemccand that would help these higher level multithreaded performance things a ton. Though, one remaining issue with this benchmark is that FAISS has its own native multi-threaded that can throw even more wrenches into the measurement! If I have learned one thing over the years, it's that benchmarking accurately is very difficult! |
@benwtrent Thanks for the input! I tried what you mentioned above:
Lucene:
Faiss:
..and the results are surprisingly similar! (ran it twice just to confirm) |
I was worried that we had some serious outstanding performance bug that has been missed in Lucene! Conceptually, it makes sense that the performance of building the index is similar as the main cost of building the index is searching the index. FAISS with this vector dimension does seem about 20% faster at search. I wonder if there is a way to get the number of vector operations that FAISS does during search. We can then make sure its simply due to them having faster vector ops or lower graph searching overhead. |
By this, I mean the number of vectors it must visit when searching the graph. |
I should add here that Lucene was using vectorized instructions via Panama, but the C_API of Faiss was not.. Lucene:
Faiss:
..and we do see slightly faster indexing times
Faiss has an |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR looks very promising. I did a high level pass on the PR and added some of my thoughts. I would checkout the code in my local and go through it in more details. Benchmarks looks promising here.
this.rawVectorsFormat = | ||
new Lucene99FlatVectorsFormat(FlatVectorScorerUtil.getLucene99FlatVectorsScorer()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since Faiss already stores this information in the index what is the reason to have a RawFlatVectorsFormat?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not all Faiss indexes store the original vectors (eg PQ) -- and trying to "reconstruct" vectors may be lossy..
This primarily affects merges, where vectors in smaller segments are read back to create a fresh one -- so we'd keep losing information with each merge
In the ideal scenario we could use Faiss if the index stores full vectors (eg HNSWFlat), and only add raw vectors to Lucene for other indexes. For now I wasn't 100% sure on how to determine this, so I'm storing them in Lucene in all cases
Note that these would only be loaded into memory during indexing, and not search (they aren't accessed by Faiss)
Another point to note is that reading back vectors from Faiss has its own cost (latency if we read them one-by-one, or memory in case of a bulk read and we may need some sort of batching)
@SuppressWarnings("unchecked") | ||
public KnnFieldVectorsWriter<?> addField(FieldInfo fieldInfo) throws IOException { | ||
return switch (fieldInfo.getVectorEncoding()) { | ||
case BYTE -> throw new UnsupportedOperationException("Byte vectors not supported"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this something which will be supported in upcoming PRs? Because the way atleast faiss supports byte vector is you pass the float[] to faiss which has values within the range of byte and then use the faiss encoders to create a byte vector index. Ref: opensearch-project/k-NN#2425 Opensearch is already looking to add this support. :) Happy to chat on this on how we can enable it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah nice! I wasn't aware about this -- added TODOs for now
// TODO: Non FS-based approach? | ||
Path tempFile = Files.createTempFile(NAME, fieldInfo.name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a limitation with Faiss where it requires a file path for writing the index. Opensearch k-NN plugin removed this limitation in recent version of Opensearch ref: opensearch-project/k-NN#2033 . So Faiss provides an IOInterface which be used to read and write the data on a stream. If this is something planned in next PRs then feel free to ignore the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, I'd like to avoid the FS based reading / writing too..
I saw these reader and writer structs, where we could pass the IndexInput::readBytes
and IndexOutput::writeBytes
as function stubs for the C process to call
My concern is that (1) it may make indexing more expensive than the C process doing all IO internally, (2) the C API does not provide methods to create and pass these around, so I'm not sure if we can make these structs purely from native C or CPP functions (would strongly like to avoid glue code to wrap these) -- but I'll try to dig deeper
|
||
// Create an index | ||
MemorySegment pointer = temp.allocate(ADDRESS); | ||
callAndHandleError(INDEX_FACTORY, pointer, dimension, temp.allocateFrom(description), metric); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is an index in faiss called as IndexIDMap, that can wrap any Faiss index which can be used here to avoid having another ordToDoc array. Faiss internally maintains that mapping and will only give you the docIds as results on what you have passed during construction. See if we want to use that. FYI opensearch actually uses that. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is an index in faiss called as IndexIDMap, that can wrap any Faiss index which can be used here to avoid having another ordToDoc array. Faiss internally maintains that mapping and will only give you the docIds as results on what you have passed during construction. See if we want to use that. FYI opensearch actually uses that. :)
I see that you have already mentioned it in your description here: Currently the ordinal -> doc id mapping is stored in Lucene and looked up at the end (can be done in Faiss using an IDMap, needs some investigation)
. Please ignore this comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was trying this offline -- added in the latest commit! Performance looks largely unchanged
Lucene:
recall latency (ms) nDoc topK fanout maxConn beamWidth quantized index s index docs/s force merge s num segments index size (MB) vec disk (MB) vec RAM (MB)
0.812 1.364 200000 100 50 32 200 no 147.53 1355.65 0.01 1 236.93 228.882 228.882
Faiss:
recall latency (ms) nDoc topK fanout maxConn beamWidth quantized index s index docs/s force merge s num segments index size (MB) vec disk (MB) vec RAM (MB)
0.811 1.091 200000 100 50 32 200 no 144.92 1380.07 0.00 1 511.97 228.882 228.882
Amen to that!! |
Indeed! I hope we can instrument / pull CPU counters correctly so that we account for a private threadpool FAISS is spawning. |
// TODO: This is like a post-filter, include at runtime? | ||
int doc = ordToDoc[ord]; | ||
if (acceptDocs == null || acceptDocs.get(doc)) { | ||
knnCollector.collect(doc, distances[i]); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
faiss has the support for filter during search. You can pass the accepted docIds via IndexIDMap : https://github.com/facebookresearch/faiss/blob/main/faiss/IndexIDMap.cpp#L130 and for HNSW: https://github.com/facebookresearch/faiss/blob/main/faiss/impl/HNSW.cpp#L923
The ids on which filter needs to be added can be pass with IDSelector interface: https://github.com/facebookresearch/faiss/blob/main/faiss/impl/IDSelector.cpp . A sample code from Opensearch: https://github.com/opensearch-project/k-NN/blob/2e603a1b47fdda3444625ca1c1ce7c34c62d510a/jni/src/faiss_wrapper.cpp#L589-L596
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I came across this -- an IDSelectorBitmap
appears to be the most relevant for us..
Lucene's liveDocs
are generally present as a FixedBits
(and acceptDocs
as a FixedBitSet
in case of a filtered search) -- both of which have an underlying long[]
The IDSelectorBitmap
expects an array of bytes, so we could dump Lucene's long[]
in BIG_ENDIAN
byte order to the C process and directly use it as a byte array!
One problem is that the C_API of Faiss does not support creating an IDSelectorBitmap
directly, so we may need:
- Changes to Faiss to expose this class from the C_API (good-to-have in the long term)
- Figure out the C-equivalent struct layout of
IDSelectorBitmap
and directly allocate it from Panama (minimally invasive)
I'll take a pass at this soon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created facebookresearch/faiss#4158 for (1)
- Create an index using `add_with_ids` instead of `add` - Remove `ordToDoc` from both indexing and search flows - Some misc changes and refactoring
Description
Faiss (https://github.com/facebookresearch/faiss) is "a library for efficient similarity search and clustering of dense vectors"
It supports various features like vector transforms (eg PCA), indexing algorithms (eg IVF, HNSW, etc), quantization techniques (eg PQ), search strategies (eg 2-step refinement), different hardware (including GPUs) -- all through a convenient and flexible Index Factory (https://github.com/facebookresearch/faiss/wiki/The-index-factory)
Proposing to add a wrapper to Lucene (via a new sandboxed
KnnVectorsFormat
) to create and search vector indexes with Faiss. OpenSearch has a similar feature, but that is implemented using JNI, which has its own overhead (need for "glue" code, separate build systems)This PR aims to have a pure Java implementation using the Panama (https://openjdk.org/projects/panama) Foreign Function Interface (FFI) to interact with the library. Faiss provides a nice C API to "to produce bindings for programming languages with Foreign Function Interface (FFI) support"
This PR does not aim to add Faiss as a dependency of Lucene, but requires the user to build the C API (https://github.com/facebookresearch/faiss/blob/main/c_api/INSTALL.md) and put the built shared executable (
libfaiss_c.so
) along with all dependencies (like OpenBLAS) on the Java library path (either the-Djava.library.path
JVM argument or$LD_LIBRARY_PATH
environment variable)More details and considerations to follow, but opening this PR to get feedback on the need, implementation, and long-term support for such a codec (i.e. can we keep this codec in Lucene)