This repository contains the implementations of the hard instances designed in our paper Worst-case Performance of Popular Approximate Nearest Neighbor Search Implementations: Guarantees and Limitations, which is to appear in the NeurIPS 2023 conference.
We provide the three data sets we construct for DiskANN, NSG, and HNSW algorithms described in our paper (Figure 2 and 4), for
Note that some of the implementations require that the data dimension is a multiple of
- Point set: test_diskann_1e6_learn.fbin
- Query point set: test_diskann_1e6_query.fbin
- Groundtruth: test_diskann_1e6_groundtruth.fbin
For point sets, the first
For groundtruth file, the first
- Point set: test_nsg_1e6_learn.fvecs
- Query point set: test_nsg_1e6_query.fvecs
- Groundtruth: test_nsg_1e6_groundtruth.ivecs
Each .fvecs point set contains
Groundtruth file is stored in .ivecs format, which contains
- Point set: test_hnsw_1e6_learn.pickle
- Query point set: test_hnsw_1e6_query.pickle
- Groundtruth: test_hnsw_1e6_groundtruth.pickle
Point set is a numpy array of dimensions $nd$ ($d=8$) containing $d$ dimensions for $n$ points.
query point set is a numpy array of dimensions $nd$ (
Please cite our paper if it is useful in your work.
@misc{indyk2023worstcase,
title={Worst-case Performance of Popular Approximate Nearest Neighbor Search Implementations: Guarantees and Limitations},
author={Piotr Indyk and Haike Xu},
year={2023},
eprint={2310.19126},
archivePrefix={arXiv},
primaryClass={cs.DS}
}