Vector Search Benchmarks

This repo contains datasets for benchmarking vector search performance, to help Superlinked prioritize integration partners.

Overview

We reviewed a number of publicly available datasets and noted 3 core problems + here is how this dataset fixes them:

Problems of other vector search benchmarks	How this dataset solves it
Not enough metadata of various types makes it hard to test filter performance	17 metadata properties - numeric, categorical, relational
Vectors too small, while SOTA models usually output 2k+ even 4k+ dims	2688 dims
Dataset too small, especially if larger vectors are used	10k, 100k, 1M and 10M item variants, all sampled from the large dataset

Dataset Issues / Notes

In pre-processing we accidentally dropped asin which is the primary key of these datasets - to validate recall we will need to add it back in in the next version of this dataset. Right now, there is no PK.
The details column has a bunch of redundancy (null values for missing keys), which if prunned will reduce the dataset size by 20-30%.
The original dataset also contains images, but since we do not aim to test embedding model inference performance with vector search vendors, image URLs were not included.

Available Datasets

The benchmark_10M.parquet dataset is the one to measure the vector search performance on. We have added smaller variants of this dataset (via uniform sampling) to make it easier to test your benchmarking setup.

Dataset	Records	File Size
benchmark_10k	9,000	207 MB
benchmark_100k	98,500	2.3 GB
benchmark_1M	1,044,500	23.4 GB
benchmark_10M	10,564,046	243 GB

To learn more about the datasets, see reports/summary_report.md and reports/benchmark_10k/README.md.

Data Access

Datasets are available via HTTPS download:

# Download benchmark datasets
wget https://storage.googleapis.com/superlinked-benchmarks-external/amazon-products/benchmark_10k.parquet
wget https://storage.googleapis.com/superlinked-benchmarks-external/amazon-products/benchmark_100k.parquet
wget https://storage.googleapis.com/superlinked-benchmarks-external/amazon-products/benchmark_1M.parquet
wget https://storage.googleapis.com/superlinked-benchmarks-external/amazon-products/benchmark_10M.parquet

Dataset Production

Source Data

Origin: Amazon Reviews 2023 dataset
Categories: Books, Automotive, Tools & Home Improvement, All Beauty, Computers

Embeddings

Our goal was to mimic a SOTA model dimensionality (e.g. Qwen3-Embedding-4B at 2560 dims) but to save resources we built similar vectors by concatenating outputs of a smaller model applied to individual string fields of each item:

Model: BAAI/bge-small-en-v1.5 (384 dims per field)
Fields Embedded: 7 text fields (title, description, features, main_category, store, categories, details)
Final Embedding: 7 × 384 = 2,688 dimensions per product

Running Benchmarks

For benchmark_10M.parquet produce the following set of measurements - basically fill in the 'TBD' cells:

#	Write	Target	Observed	Read	Target	Observed
1	Create Index from scratch	< 2hrs	TBD	-	-	-
2	-	-	-	20 QPS of 0.001% filter selectivity	100ms @ p95	TBD
3	-	-	-	20 QPS of 0.1% filter selectivity	100ms @ p95	TBD
4	-	-	-	20 QPS of 1% filter selectivity	100ms @ p95	TBD
5	-	-	-	20 QPS of 10% filter selectivity	100ms @ p95	TBD
6	20 QPS for single-object updates (incl. embedding)	2s @ p95	TBD	20 QPS of 1% filter selectivity	100ms @ p95	TBD
7	200 QPS for single-object updates (incl. embedding)	2s @ p95	TBD	20 QPS of 1% filter selectivity	100ms @ p95	TBD

Formulate the queries like this:

Vector Similarity: Each query should contain dot product similarity scoring against a vector that you grab at random from the dataset. Note - if your system caches the vector-specific computations, please rotate a large set of random vectors - otherwise you can use the same vector.
Filters: To get the target filter selectivity, please use one of the filter predicates below or similar.
Results details: Add LIMIT 100 to all queries and only retrieve parent_asin for each record to minimize networking overhead (until we add asin back in, see Dataset Issues above).
Vector Search Recall: We expect that you can tune your system to produce >90% average recall for the ANN index and we expect that you run the above tests with such tunning.

Selectivity	Predicate
0.001%	`average_rating <= 3.0 and rating_number > 130 and main_category == 'Computers'`
0.1%	`average_rating <= 3.5 and rating_number > 15 and main_category == 'Computers'`
1%	`average_rating >= 3.5 and rating_number > 10 and main_category == 'Computers'`
10%	`main_category in ['Computers', 'All Beauty', 'Buy a Kindle']`

Pricing

To enable us to compare different vendors, we consider the above dataset size + performance to be a "unit" of vector search, for which we would like to know:

What are the vector search vendor parameters of the cloud instance that can support this "unit".
What is the price-per-GB-month for this instance, assuming a sustained average workload as described by the targets above.
How does the price scale with (a) 2x the size (b) 2x the read QPS (c) 2x the write QPS.

License

This dataset is derived from the Amazon Reviews 2023 dataset. Please refer to the original dataset's license for usage terms.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
reports		reports
scripts		scripts
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Vector Search Benchmarks

Overview

Dataset Issues / Notes

Available Datasets

Data Access

Dataset Production

Source Data

Embeddings

Running Benchmarks

Pricing

License

About

Uh oh!

Releases

Packages

Languages

superlinked/external-benchmarks

Folders and files

Latest commit

History

Repository files navigation

Vector Search Benchmarks

Overview

Dataset Issues / Notes

Available Datasets

Data Access

Dataset Production

Source Data

Embeddings

Running Benchmarks

Pricing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages