Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a Faiss codec for KNN searches #14178

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

Conversation

kaivalnp
Copy link
Contributor

Description

Faiss (https://github.com/facebookresearch/faiss) is "a library for efficient similarity search and clustering of dense vectors"

It supports various features like vector transforms (eg PCA), indexing algorithms (eg IVF, HNSW, etc), quantization techniques (eg PQ), search strategies (eg 2-step refinement), different hardware (including GPUs) -- all through a convenient and flexible Index Factory (https://github.com/facebookresearch/faiss/wiki/The-index-factory)

Proposing to add a wrapper to Lucene (via a new sandboxed KnnVectorsFormat) to create and search vector indexes with Faiss. OpenSearch has a similar feature, but that is implemented using JNI, which has its own overhead (need for "glue" code, separate build systems)

This PR aims to have a pure Java implementation using the Panama (https://openjdk.org/projects/panama) Foreign Function Interface (FFI) to interact with the library. Faiss provides a nice C API to "to produce bindings for programming languages with Foreign Function Interface (FFI) support"

This PR does not aim to add Faiss as a dependency of Lucene, but requires the user to build the C API (https://github.com/facebookresearch/faiss/blob/main/c_api/INSTALL.md) and put the built shared executable (libfaiss_c.so) along with all dependencies (like OpenBLAS) on the Java library path (either the -Djava.library.path JVM argument or $LD_LIBRARY_PATH environment variable)

More details and considerations to follow, but opening this PR to get feedback on the need, implementation, and long-term support for such a codec (i.e. can we keep this codec in Lucene)

@kaivalnp
Copy link
Contributor Author

Description

  1. Separate Faiss indexes are maintained per-segment per-field, in line with Lucene's architecture (and the current vector format)
  2. Vectors are buffered in memory until flush, copied over to the native process, and added to the Faiss index using a single bulk add
  3. Different Faiss indexes (one for each field) are concatenated and stored in a single data file .faissd at flush time, and corresponding metadata information is stored in a separate file .faissm
  4. On read time, temp files are created for separate Faiss indexes (one for each field) based on offsets stored in the meta file, read into memory, and temp files are deleted thereafter
  5. On search time, the query vector is copied over to the native process, a native search is performed, and results are copied back to Java
  6. Currently the ordinal -> doc id mapping is stored in Lucene and looked up at the end (can be done in Faiss using an IDMap, needs some investigation)

TODOs

  1. RAM and disk usage
    • Faiss is RAM heavy and explicitly loads most indexes into memory (as opposed to the current Lucene implementation which keeps vectors on disk, and reads them via MMAP)
    • The current state of the PR biases towards performance over memory and disk usage (eg indexing all docs together instead of batches, creating temp files on disk instead of using IO wrappers) and can be tweaked to have a more balanced performance v/s memory and disk usage
    • Also lacks accurate RAM usage tracking of the Faiss library
  2. Live docs as a search-time filter
    • The current state of the PR removes deleted docs as a post-filter instead of considering them during graph search time
    • These live docs are generally present as a BitSet (with an underlying long[]) and could be copied over to Faiss, which supports a filtered search (an IDSelectorBitmap may be ideal here, but is not currently exposed via the C API)
    • This would also need storing doc ids directly in Faiss (using an IDMap) as opposed to the ordinals
  3. More control over training
    • Some indexes in Faiss (like PQ) require training before they can be used (to understand the document space, and create internal data structures)
    • The current state of the PR simply uses all vectors for training to bias towards higher search-time performance over indexing-time, and we may need to expose more configurability here
  4. Use more specialised native functions
    • For example a native Faiss index merge during Lucene segment merges, but this has its own considerations (like deleted docs, changing doc ids, etc)
  5. Double storage of vectors
    • Some Faiss indexes are unable to reconstruct full-precision vectors once added
    • This would mean a loss of information with each merge, which is undesirable -- so we store the original vectors in Lucene as well
    • These vectors would increase disk usage, but not necessarily RAM as long as they are not accessed

Long-term considerations

Using a C/CPP shared library makes it difficult to debug and profile native code
Handled exceptions in Faiss are gracefully rethrown in Lucene, but unhandled signals or bugs (like segmentation faults) cannot be recovered from, and kills the Java process with it!

@kaivalnp
Copy link
Contributor Author

Usage

The new format can be used by:

For example, creating an HNSW index with maxConn=32 and beamWidth=200 is as simple as:

new FaissKnnVectorsFormatProvider("HNSW32", "efConstruction=200");

Adding PQ to this index is as simple as:

new FaissKnnVectorsFormatProvider("HNSW32_PQ50", "efConstruction=200");

Reordering the final results using exact distances is as simple as:

new FaissKnnVectorsFormatProvider("HNSW32_PQ50,RFlat", "efConstruction=200");

..and so on

Benchmarks

I built this PR using Java 22 and benchmarked it using knnPerfTest (needed some small changes to add the sandbox JAR file to the classpath here and the built Faiss shared library with its dependencies during runtime)

Uses 300d documents and vectors generated using:

./gradlew vector-300

from the luceneutil package

Lucene:

recall  latency (ms)    nDoc  topK  fanout  maxConn  beamWidth  quantized  index s  index docs/s  force merge s  num segments  index size (MB)  vec disk (MB)  vec RAM (MB)
 0.811         1.482  200000   100      50       32        200         no    52.65       3798.38           0.00             1           237.77        228.882       228.882

Faiss:

recall  latency (ms)    nDoc  topK  fanout  maxConn  beamWidth  quantized  index s  index docs/s  force merge s  num segments  index size (MB)  vec disk (MB)  vec RAM (MB)
 0.809         1.101  200000   100      50       32        200         no     6.06      33030.55          11.05             1           511.22        228.882       228.882

Corresponding format used:

// efSearch is set as topK + fanout
new FaissKnnVectorsFormatProvider("HNSW32", "efConstruction=200,efSearch=150");

This is a single segment search with no deletes, but we see ~88% index-time speedup and ~26% search-time speedup!

Used a fairly powerful machine (m5.12xlarge), and default numMergeWorker and numMergeThread values

One other thing to note is that the Faiss C_API does not use vectorized (AVX2 / AVX512 / SVE) instructions -- so we can squeeze some more performance out of it by building optimized versions of the library

@benwtrent
Copy link
Member

Some very interesting numbers @kaivalnp

Almost 10x indexing throughput improvement tells me we are doing something silly in Lucene. Especially since the search time is only about 25% better.

The search time numbers make me wonder if the differential is mainly that reads the floats onto heap. Maybe it can be just as fast by not reading the floating point vectors on to heap and doing memory segment stuff (which gets tricky, but not impossible).

Does FAISS index needs the "flat" vector storage at all? I thought FAISS gave direct access to the vector values based on ordinals? Or do you have to index it in a special way?

I can try to replicate the performance numbers when I can.

One thing that stands out to me, is that during merge, all vectors are buffered onto heap, which is pretty dang expensive :/

- Fix javadocs
- Fallback to null if underlying format is not present
@kaivalnp
Copy link
Contributor Author

Maybe it can be just as fast by not reading the floating point vectors on to heap and doing memory segment stuff

Interesting, do we have a Lucene PR that explores it?

Does FAISS index needs the "flat" vector storage at all? I thought FAISS gave direct access to the vector values based on ordinals?

Faiss does not need (and does not use) the flat vectors at search time, and it does provide access to underlying vectors based on ordinals -- but these are "reconstructed" vectors and may be lossy in some indexes (for example PQ)

Because of this loss of information, vectors would keep getting more approximate with each merge (where we read back all vectors and create a fresh index) -- which is not desirable

We could technically store the raw vectors within Faiss (say another flat index) -- but exposing them via a FloatVectorValues would require multiple native "reconstruct" calls. It would be similar storage-wise, so I just went with one of Lucene's flat formats (which also provides higher control over memory-mapping)

all vectors are buffered onto heap, which is pretty dang expensive

+1 -- I'd like to reduce memory pressure in the future. One thing @msokolov pointed out offline is that we're using the flat format anyways -- we could flush that first and read back the vectors (but this time disk-backed -- so we reduce the double copy in memory). I'm not sure if the current APIs allow it, but I'll try to address this

The least memory usage would be by adding vectors one-by-one to the Faiss index and not store them on heap at all, but I suspect this would hurt indexing performance due to multiple native calls (one per document). We could possibly index vectors in "batches" as a middle ground

Also, the "train" call requires all training vectors to be passed at once -- so this is another bottleneck (i.e. we need to keep all training vectors in memory)

I can try to replicate the performance numbers

Thanks, this would be super helpful!

@jimczi
Copy link
Contributor

jimczi commented Jan 29, 2025

Almost 10x indexing throughput improvement tells me we are doing something silly in Lucene.

I did not test this specific integration but Faiss is multithreaded on bulk training, adding and searching so we have to be careful when comparing the result.
The benchmark results show a single segment but only Faiss reports time spent in the force merge.
Does that mean that you indexed with -numIndexThreads=1 for the Lucene run? Since Faiss uses multithreading by default, we cannot compare with Lucene if we don't use multi thread indexing and merging.

@kaivalnp
Copy link
Contributor Author

Since Faiss uses multithreading by default, we cannot compare with Lucene

Ah nice catch, the number of threads used by both may be different..

I'm not sure how many threads were used by Faiss above, but the number of threads used by Lucene are specified here (I didn't change these)

I set $OMP_NUM_THREADS=4 (from the link you sent) to keep the number of threads same in both:

Lucene:

recall  latency (ms)    nDoc  topK  fanout  maxConn  beamWidth  quantized  index s  index docs/s  force merge s  num segments  index size (MB)  vec disk (MB)  vec RAM (MB)
 0.811         1.439  200000   100      50       32        200         no    51.98       3847.63           0.00             1           237.75        228.882       228.882

Faiss:

recall  latency (ms)    nDoc  topK  fanout  maxConn  beamWidth  quantized  index s  index docs/s  force merge s  num segments  index size (MB)  vec disk (MB)  vec RAM (MB)
 0.810         1.110  200000   100      50       32        200         no    15.92      12565.18          41.44             1           511.21        228.882       228.882

Not as high as 10x anymore, but it is still ~3x faster

Does that mean that you indexed with -numIndexThreads=1 for the Lucene run?

This was set to 8 for both runs (I didn't change the default value)

@jimczi
Copy link
Contributor

jimczi commented Jan 29, 2025

Not as high as 10x anymore, but it is still ~3x faster

Not so easy ;) See the force merge time for Faiss (41.44 s). The force merge is the time it took to merge the created segments into 1 (the final number of segment for your experiment). So the total time is index + force merge time and in these runs it seems in the same ballpark for both.
I don't understand why force merge time is 0 for the Lucene version though.

@kaivalnp
Copy link
Contributor Author

Ah I see :)

The force merge is the time it took to merge the created segments into 1

Does it mean that the Faiss benchmark created a larger number of segments initially, which had to be merged into 1?

If so, would it mean that we created smaller indexes (i.e. HNSW graphs) for the original segments, and then had to recreate a larger one from scratch during force merge? (because we start fresh during merges)

I wonder if the numbers are comparable in this case, I'll try to dig deeper on the pre-merge segment count mismatch between the two..

@benwtrent
Copy link
Member

@kaivalnp the force-merge time indicates that during merge to a single segment, the index is being rebuilt from various segments. I would think that the force-merge time itself is more indicative of the cost of indexing than just the initial index phase.

For this benchmark there are a couple of options:

  • Increase the KnnIndexer buffer size to allow 2x of the float vectors in memory (thus keeping a flush from occurring until the graph is ready to be built) and remove the "force-merge" option completely. This will also a single segment to be created. Just make sure you have enough heap allocated.
  • Simply sum the two numbers together.
  • Don't forcemerge at all and just accept multiple segments.

One other concern. There are two types of "multi-threading" in the indexing. There are the number of threads doing the indexing (e.g. writing to an indexer and creating a segment) and the number of threads used when building a graph. For simplicity, I would reduce the number of indexing threads to 1, faiss threads to 1, and merge workers to 1. Once we have numbers of the cost of running on a single thread, then we can see how adjusting these allows one to pull ahead of the other.

@mikemccand
Copy link
Member

Really, luceneutil should report total CPU cycles consumed during indexing and searching (summed across all threads)... I'll open an issue for this.

@benwtrent
Copy link
Member

should report total CPU cycles consumed during indexing and searching (summed across all threads)...

@mikemccand that would help these higher level multithreaded performance things a ton. Though, one remaining issue with this benchmark is that FAISS has its own native multi-threaded that can throw even more wrenches into the measurement!

If I have learned one thing over the years, it's that benchmarking accurately is very difficult!

@kaivalnp
Copy link
Contributor Author

kaivalnp commented Jan 29, 2025

@benwtrent Thanks for the input! I tried what you mentioned above:

Increase the KnnIndexer buffer size to allow 2x of the float vectors in memory
I would reduce the number of indexing threads to 1, faiss threads to 1, and merge workers to 1

Lucene:

recall  latency (ms)    nDoc  topK  fanout  maxConn  beamWidth  quantized  index s  index docs/s  force merge s  num segments  index size (MB)  vec disk (MB)  vec RAM (MB)
 0.812         1.405  200000   100      50       32        200         no   146.34       1366.70           0.01             1           236.93        228.882       228.882

Faiss:

recall  latency (ms)    nDoc  topK  fanout  maxConn  beamWidth  quantized  index s  index docs/s  force merge s  num segments  index size (MB)  vec disk (MB)  vec RAM (MB)
 0.811         1.106  200000   100      50       32        200         no   146.58       1364.41           0.01             1           511.20        228.882       228.882

..and the results are surprisingly similar! (ran it twice just to confirm)

@benwtrent
Copy link
Member

@kaivalnp 😌

I was worried that we had some serious outstanding performance bug that has been missed in Lucene!

Conceptually, it makes sense that the performance of building the index is similar as the main cost of building the index is searching the index.

FAISS with this vector dimension does seem about 20% faster at search. I wonder if there is a way to get the number of vector operations that FAISS does during search. We can then make sure its simply due to them having faster vector ops or lower graph searching overhead.

@benwtrent
Copy link
Member

number of vector operations that FAISS does during search.

By this, I mean the number of vectors it must visit when searching the graph.

@kaivalnp
Copy link
Contributor Author

FAISS with this vector dimension does seem about 20% faster at search

I should add here that Lucene was using vectorized instructions via Panama, but the C_API of Faiss was not..
I tweaked the offline build to use AVX512 instructions from Faiss as well (basically link it to libfaiss_avx512.so instead of libfaiss.so):

Lucene:

recall  latency (ms)    nDoc  topK  fanout  maxConn  beamWidth  quantized  index s  index docs/s  force merge s  num segments  index size (MB)  vec disk (MB)  vec RAM (MB)
 0.812         1.424  200000   100      50       32        200         no   145.30       1376.49           0.01             1           236.93        228.882       228.882

Faiss:

recall  latency (ms)    nDoc  topK  fanout  maxConn  beamWidth  quantized  index s  index docs/s  force merge s  num segments  index size (MB)  vec disk (MB)  vec RAM (MB)
 0.811         1.127  200000   100      50       32        200         no   129.18       1548.20           0.01             1           511.20        228.882       228.882

..and we do see slightly faster indexing times

number of vectors it must visit when searching the graph

Faiss has an HNSWStats struct exposed via a global variable -- I'll try to access this from Java somehow

Copy link
Contributor

@navneet1v navneet1v left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR looks very promising. I did a high level pass on the PR and added some of my thoughts. I would checkout the code in my local and go through it in more details. Benchmarks looks promising here.

Comment on lines +50 to +51
this.rawVectorsFormat =
new Lucene99FlatVectorsFormat(FlatVectorScorerUtil.getLucene99FlatVectorsScorer());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since Faiss already stores this information in the index what is the reason to have a RawFlatVectorsFormat?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not all Faiss indexes store the original vectors (eg PQ) -- and trying to "reconstruct" vectors may be lossy..

This primarily affects merges, where vectors in smaller segments are read back to create a fresh one -- so we'd keep losing information with each merge

In the ideal scenario we could use Faiss if the index stores full vectors (eg HNSWFlat), and only add raw vectors to Lucene for other indexes. For now I wasn't 100% sure on how to determine this, so I'm storing them in Lucene in all cases

Note that these would only be loaded into memory during indexing, and not search (they aren't accessed by Faiss)

Another point to note is that reading back vectors from Faiss has its own cost (latency if we read them one-by-one, or memory in case of a bulk read and we may need some sort of batching)

@SuppressWarnings("unchecked")
public KnnFieldVectorsWriter<?> addField(FieldInfo fieldInfo) throws IOException {
return switch (fieldInfo.getVectorEncoding()) {
case BYTE -> throw new UnsupportedOperationException("Byte vectors not supported");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this something which will be supported in upcoming PRs? Because the way atleast faiss supports byte vector is you pass the float[] to faiss which has values within the range of byte and then use the faiss encoders to create a byte vector index. Ref: opensearch-project/k-NN#2425 Opensearch is already looking to add this support. :) Happy to chat on this on how we can enable it here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah nice! I wasn't aware about this -- added TODOs for now

Comment on lines +134 to +135
// TODO: Non FS-based approach?
Path tempFile = Files.createTempFile(NAME, fieldInfo.name);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a limitation with Faiss where it requires a file path for writing the index. Opensearch k-NN plugin removed this limitation in recent version of Opensearch ref: opensearch-project/k-NN#2033 . So Faiss provides an IOInterface which be used to read and write the data on a stream. If this is something planned in next PRs then feel free to ignore the comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, I'd like to avoid the FS based reading / writing too..

I saw these reader and writer structs, where we could pass the IndexInput::readBytes and IndexOutput::writeBytes as function stubs for the C process to call

My concern is that (1) it may make indexing more expensive than the C process doing all IO internally, (2) the C API does not provide methods to create and pass these around, so I'm not sure if we can make these structs purely from native C or CPP functions (would strongly like to avoid glue code to wrap these) -- but I'll try to dig deeper


// Create an index
MemorySegment pointer = temp.allocate(ADDRESS);
callAndHandleError(INDEX_FACTORY, pointer, dimension, temp.allocateFrom(description), metric);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is an index in faiss called as IndexIDMap, that can wrap any Faiss index which can be used here to avoid having another ordToDoc array. Faiss internally maintains that mapping and will only give you the docIds as results on what you have passed during construction. See if we want to use that. FYI opensearch actually uses that. :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is an index in faiss called as IndexIDMap, that can wrap any Faiss index which can be used here to avoid having another ordToDoc array. Faiss internally maintains that mapping and will only give you the docIds as results on what you have passed during construction. See if we want to use that. FYI opensearch actually uses that. :)

I see that you have already mentioned it in your description here: Currently the ordinal -> doc id mapping is stored in Lucene and looked up at the end (can be done in Faiss using an IDMap, needs some investigation) . Please ignore this comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was trying this offline -- added in the latest commit! Performance looks largely unchanged

Lucene:

recall  latency (ms)    nDoc  topK  fanout  maxConn  beamWidth  quantized  index s  index docs/s  force merge s  num segments  index size (MB)  vec disk (MB)  vec RAM (MB)
 0.812         1.364  200000   100      50       32        200         no   147.53       1355.65           0.01             1           236.93        228.882       228.882

Faiss:

recall  latency (ms)    nDoc  topK  fanout  maxConn  beamWidth  quantized  index s  index docs/s  force merge s  num segments  index size (MB)  vec disk (MB)  vec RAM (MB)
 0.811         1.091  200000   100      50       32        200         no   144.92       1380.07           0.00             1           511.97        228.882       228.882

@mikemccand
Copy link
Member

If I have learned one thing over the years, it's that benchmarking accurately is very difficult!

Amen to that!!

@mikemccand
Copy link
Member

should report total CPU cycles consumed during indexing and searching (summed across all threads)...

@mikemccand that would help these higher level multithreaded performance things a ton. Though, one remaining issue with this benchmark is that FAISS has its own native multi-threaded that can throw even more wrenches into the measurement!

Indeed! I hope we can instrument / pull CPU counters correctly so that we account for a private threadpool FAISS is spawning.

Comment on lines 230 to 234
// TODO: This is like a post-filter, include at runtime?
int doc = ordToDoc[ord];
if (acceptDocs == null || acceptDocs.get(doc)) {
knnCollector.collect(doc, distances[i]);
}
Copy link
Contributor

@navneet1v navneet1v Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I came across this -- an IDSelectorBitmap appears to be the most relevant for us..

Lucene's liveDocs are generally present as a FixedBits (and acceptDocs as a FixedBitSet in case of a filtered search) -- both of which have an underlying long[]

The IDSelectorBitmap expects an array of bytes, so we could dump Lucene's long[] in BIG_ENDIAN byte order to the C process and directly use it as a byte array!

One problem is that the C_API of Faiss does not support creating an IDSelectorBitmap directly, so we may need:

  1. Changes to Faiss to expose this class from the C_API (good-to-have in the long term)
  2. Figure out the C-equivalent struct layout of IDSelectorBitmap and directly allocate it from Panama (minimally invasive)

I'll take a pass at this soon

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created facebookresearch/faiss#4158 for (1)

- Create an index using `add_with_ids` instead of `add`
- Remove `ordToDoc` from both indexing and search flows
- Some misc changes and refactoring
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants