This repository contains to code for our paper "A Bi-metric Framework for Fast Similarity Search" (https://arxiv.org/abs/2406.02891).
We propose a new "bi-metric" framework for designing nearest neighbor data structures. Our framework assumes two dissimilarity functions: a ground-truth metric that is accurate but expensive to compute, and a proxy metric that is cheaper but less accurate. In both theory and practice, we show how to construct data structures using only the proxy metric such that the query procedure achieves the accuracy of the expensive metric, while only using a limited number of calls to both metrics. In this repository, we apply the framework to the text retrieval problem with two dissimilarity functions evaluated by ML models with vastly different computational costs. We observe that for almost all datasets in the 15 MTEB benchmark, our approach achieves a considerably better accuracy-efficiency tradeoff than the alternatives, such as re-ranking.
You need to generate our environment from requirements.txt
pip install -r requirements.txt
Run download.sh
to download the datasets you want to test. Currently, we have included all the 15 MTEB retrieval benchmark datasets in the script.
./download.sh
Run preprocessing_embeddings.py
to produce all the embeddings for the bi-encoders you want to test (both the expensive model and the model used as a distance proxy). Currently, we have included all 15 MTEB retrieval benchmark datasets and all the 4 bi-encoders in the python code as mentioned in our paper. Note that it takes a bit longer to generate embeddings for the mistral
model. Feel free to adjust the datasets and model names used in the python code.
./preprocessing_embeddings.py
Run test.sh
to test different methods on each dataset. Currently, we have chosen bge-micro
as the distance proxy model and mistral
as the expensive model. We have included all 15 datasets in the script.
./test.sh
bi(ours)
is our methodbi(baseline)
is the first retrieve and then rerank baseline.single-metric
is to build and search only using the expensive model.
Note that you need to first run single-metric
to generate the groundtruth nearest neighbors for evaluating the recall rate, as in the current script.
After that, you should be able to find the CSV results in the results
folder. The default script generates the results shown in Figure 1 of our paper.
Please cite the paper as
@misc{xu2024bimetric,
title={A Bi-metric Framework for Fast Similarity Search},
author={Haike Xu and Sandeep Silwal and Piotr Indyk},
year={2024},
eprint={2406.02891},
archivePrefix={arXiv},
primaryClass={cs.IR}
}
And feel free to contact the authors at: {haikexu, silwal, indyk}@mit.edu