Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Draft] Support Multi-Vector HNSW Search via Flat Vector Storage #14173

Draft
wants to merge 21 commits into
base: main
Choose a base branch
from

Conversation

vigyasharma
Copy link
Contributor

Another take at #12313

The following PR adds support for independent multi-vectors, i.e. scenarios where a single document is represented by multiple independent vector values. The most common example for this, is the passage-vector search use case, where we create a vector for every paragraph chunk in the document.

Currently, Lucene only supports a single vector per document. Users are required to create parent-child relationships with the chunked vectors of a document, and call a ParentBlockJoin query to run passage vector search. This change allows indexing multiple vectors within the same document.

Each vector is still assigned a unique int ordinal but multiple ordinals can now map to the same document. We use additional metadata to maintain the many-one ordToDoc mapping, and also quickly figure out the first indexed vector ordinal for a document (called baseOrdinal (baseOrd)). This gives us new APIs that fetch all vectors for a document, which can be used for faster scoring (as opposed to the child doc query in ParentJoin approach):

// iterator on vector values for the doc corresponding to provided ord
public Iterator<float[]> allVectorValues(int ord);

// simpler API, returns iterator on all vectors for doc corresponding to given base ord.
public Iterator<float[]> allVectorValues(int baseOrd, int ordCount); 

// ... same APIs for ByteVectorValues

Interface

The interface to use multi-vector values is quite simple now:

// indexing
Document doc = new Document();
doc.add(vector1);
doc.add(vector2);
...
doc.add(vectorN);
iw.addDocument(doc);

// query
KnnFloatMultiVectorQuery query = new KnnFloatMultiVectorQuery(field, target, k);
searcher.search(query, k);

I was able to add a multi-vector benchmark to luceneutil to run this setup end to end. Will link results and a luceneutil PR in comments.

Pending Tasks:

This is an early draft to get some feedback, I have TODOs across the code for future improvements. Here are some big items pending:

  • Backward compatibility for the storage format
  • New version for vector storage format (Lucene 111)?
  • Support for merging on multi vector values
  • Optimization for single-valued vectors (store less metadata)
  • Support for scoring based on all vectors of a document (e.g. ScoreMode.Avg)
  • Unit tests
  • Support for multi-valued vectors in quantized vectors.

__
Note: This change does not include dependent multi-valued vectors like ColBERT, where the multiple vectors must used together to compute similarity. It does however lay essential ground work which can subsequently be extended for this support.

@vigyasharma
Copy link
Contributor Author

Ran some early benchmarks to compare this flat storage based multi-vector approach with the existing parent-join approach. I would appreciate any feedback on the approach, benchmark setup, or any mistakes you spot along the way.

Observations:

  1. Latency and recall are better with multiVectors, when both parentJoin and multiVector benchmarks are run on my branch. However, the parentJoin benchmark has significantly better latency and recall when run on main branch. Some key differences between my branch and main branch runs are:
    1. My branch always creates and loads the metadata needed for multiVector, even in the single vector (parentJoin) case. I went with a simplistic approach here so my guess is that this is the source of latency.
    2. I compared by disabling merging for both benchmarks, because I haven't implemented merging changes yet.
    3. I've added run results below, but I wouldn't put too much faith in them until we narrowed down the latency cause.
  2. For parentJoin benchmark run on main, there is a visible drop in recall when I disable merges (as compared to a main branch run with merges enabled). Is this expected?

...

Note that nDoc on parentJoin is numVectors + nDoc on multiVector runs. This is from the parent documents created in addition to child vector documents.

ParentJoin v/s MultiVector (on multi-vector branch)

# multivector
recall  latency (ms)  nVectors  nDoc  topK  fanout  maxConn  beamWidth  quantized  index s  index docs/s  num segments  index size (MB)  vec disk (MB)  vec RAM (MB)
 0.673         3.548     10000   103   100      50       32        100         no     1.62         63.58             1            29.40         29.297        29.297
 0.431         4.857    100000  1323   100      50       32        100         no    11.42        115.84             3           294.08        292.969       292.969
 0.461         8.034    200000  2939   100      50       32        100         no    22.62        129.92             6           588.27        585.938       585.938
 0.496        16.040    500000  8773   100      50       32        100         no    53.50        163.98            14          1470.72       1464.844      1464.844


# parentJoin on multi-vector branch 
# (merges disabled, creates and loads multivector config)
recall  latency (ms)  nVectors    nDoc  topK  fanout  maxConn  beamWidth  quantized  index s  index docs/s  num segments  index size (MB)  vec disk (MB)  vec RAM (MB)
 0.610         4.644     10000   10103   100      50       32        100         no     1.70       5946.44             1            29.51         29.297        29.297
 0.242         5.189    100000  101323   100      50       32        100         no    11.34       8935.80             3           295.17        292.969       292.969
 0.275         8.988    200000  202939   100      50       32        100         no    22.54       9005.50             6           590.51        585.938       585.938
 0.290        16.605    500000  508773   100      50       32        100         no    52.70       9654.32            14          1476.26       1464.844      1464.844

...

ParentJoin (on main) v/s MultiVector (on multivector branch)

# parentJoin (on main)
recall  latency (ms)    nDoc  topK  fanout  maxConn  beamWidth  quantized  index s  index docs/s  force merge s  num segments  index size (MB)  vec disk (MB)  vec RAM (MB)
 0.958         1.160   10000   100      50       32        100         no     1.49       6706.91           1.85             1            29.67         29.297        29.297
 0.925         2.392  100000   100      50       32        100         no    34.98       2858.86           7.86             1           297.91        292.969       292.969
 0.914         2.972  200000   100      50       32        100         no    63.80       3134.94          43.48             1           596.14        585.938       585.938
 0.904         4.292  500000   100      50       32        100         no   151.49       3300.57         147.08             1          1491.81       1464.844      1464.844

# multivector
recall  latency (ms)  nVectors  nDoc  topK  fanout  maxConn  beamWidth  quantized  index s  index docs/s  num segments  index size (MB)  vec disk (MB)  vec RAM (MB)
 0.673         3.793     10000   103   100      50       32        100         no     1.59         64.78             1            29.40         29.297        29.297
 0.431         4.572    100000  1323   100      50       32        100         no    11.22        117.87             3           294.08        292.969       292.969
 0.461         7.681    200000  2939   100      50       32        100         no    22.38        131.32             6           588.27        585.938       585.938
 0.496        16.292    500000  8773   100      50       32        100         no    54.10        162.17            14          1470.72       1464.844      1464.844

...

ParentJoin with merges v/s ParentJoin with merges disabled (both on main)

# parentJoin (on main)
recall  latency (ms)    nDoc  topK  fanout  maxConn  beamWidth  quantized  index s  index docs/s  force merge s  num segments  index size (MB)  vec disk (MB)  vec RAM (MB)
 0.958         1.160   10000   100      50       32        100         no     1.49       6706.91           1.85             1            29.67         29.297        29.297
 0.925         2.392  100000   100      50       32        100         no    34.98       2858.86           7.86             1           297.91        292.969       292.969
 0.914         2.972  200000   100      50       32        100         no    63.80       3134.94          43.48             1           596.14        585.938       585.938
 0.904         4.292  500000   100      50       32        100         no   151.49       3300.57         147.08             1          1491.81       1464.844      1464.844

## parentJoin on main (merge disabled):
recall  latency (ms)    nDoc  topK  fanout  maxConn  beamWidth  quantized  index s  index docs/s  force merge s  num segments  index size (MB)  vec disk (MB)  vec RAM (MB)
 0.440         1.297   10000   100      50       32        100         no     1.76       5694.76           2.03             1            29.67         29.297        29.297
 0.692         2.596  100000   100      50       32        100         no    11.35       8807.47          29.76             1           297.86        292.969       292.969
 0.530         3.173  200000   100      50       32        100         no    22.03       9077.71          67.91             1           596.24        585.938       585.938
 0.598         4.368  500000   100      50       32        100         no    53.20       9398.50         204.29             1          1493.26       1464.844      1464.844

@benwtrent
Copy link
Member

For parentJoin benchmark run on main, there is a visible drop in recall when I disable merges (as compared to a main branch run with merges enabled). Is this expected?

I wonder if you are comparing vector IDs correctly?

Looking at the "merges enabled/disabled" it doesn't make sense to me as one would be better than the other as both are then force-merged into a single segment.

I also don't understand the recall change between parentJoin on main vs. parentJoin in your branch. That is a significant difference, which seems like there is a bug in the test runner or in the code itself.

These numbers indeed confuse me. Maybe there is a bug in sparse vector index handling?

@benwtrent
Copy link
Member

I like where this PR is going.

Note: This change does not include dependent multi-valued vectors like ColBERT, where the multiple vectors must used together to compute similarity. It does however lay essential ground work which can subsequently be extended for this support.

I think this PR is still doing globally unique ordinals for vectors? So, vectors 1, 2, 3 go to document 1 and ordinals 4, 5 go to doc 2? If so, I think we should "bite the bullet" and make vector ordinals long values. I know this makes HNSW graph building 2x as expensive when it comes to memory usage. But it seems like something we should do.

Models like ColPALI (and ColBERT) will index 100s or as much as 1k vectors per document. This will cause the number of vectors per lucene segment to be restricted to 2-20M, much lower than it is now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants