[👑 NeurIPS 2022 Outstanding Paper] A Neural Corpus Indexer for Document Retrieval -- NCI (Paper)
NCI is an end-to-end, sequence-to-sequence differentiable document retrieval model which retrieve relevant document identifiers directly for specific queries. In our evaluation on Google NQ dataset and TriviaQA dataset, NCI outperforms all baselines and model-based indexers:
Model | Recall@1 | Recall@10 | Recall@100 | MRR@100 |
---|---|---|---|---|
NCI (Ensemble) | 70.46 | 89.35 | 94.75 | 77.82 |
NCI (Large) | 66.23 | 85.27 | 92.49 | 73.37 |
NCI (Base) | 65.86 | 85.20 | 92.42 | 73.12 |
DSI (T5-Base) | 27.40 | 56.60 | -- | -- |
DSI (T5-Large) | 35.60 | 62.60 | -- | -- |
SEAL (Large) | 59.93 | 81.24 | 90.93 | 67.70 |
ANCE (MaxP) | 52.63 | 80.38 | 91.31 | 62.84 |
BM25 + DocT5Query | 35.43 | 61.83 | 76.92 | 44.47 |
For more information, checkout our publications: https://arxiv.org/abs/2206.02743
[1] Install Anaconda.
[2] Clone repository:
git clone https://github.com/solidsea98/Neural-Corpus-Indexer-NCI.git
cd Neural-Corpus-Indexer-NCI
[3] Create conda environment:
conda env create -f environment.yml
conda activate NCI
[4] Docker:
If necessary, the NCI docker is mzmssg/corpus_env:latest.
You can process data with NQ_dataset_Process.ipynb and Trivia_dataset_Process.ipynb.
[1] Dataset Download.
Currently NCI is evaluated on Google NQ dataset and TriviaQA dataset. Please download it before re-training.
[2] Semantic Identifier
NCI uses content-based document identifiers: A pre-trained BERT is used to generate document embeddings, and then documents are clustered using hierarchical K-means and semantic identifiers are assigned to each document. You can generate several embeddings and semantic identifiers to run NCI model for ensembling.
[3] Query Generation
In our study, Query Generation can significantly improve retrieve performance, especially for long-tail queries.
NCI uses docTTTTTquery checkpoint to generate synthetic queries. Please refer to docTTTTTquery documentation.
Find more details in NQ_dataset_Process.ipynb and Trivia_dataset_Process.ipynb.
Once the data pre-processing is complete, you can launch training by train.sh. You can also launch training along with our NQ data (Download it to './Data_process/NQ_dataset/') and TriviaQA data (Download it to './Data_process/trivia_dataset/').
Please use infer.sh along with our NQ checkpoint or TriviaQA checkpoint (Download it to './NCI_model/logs/'). You can also inference with your own checkpoint to evaluate model performance.
Please ensemble NQ dataset or TriviaQA dataset along with our results (Download it to './NCI_model/logs/') or your own results.
If you find this work useful for your research, please cite:
@article{wang2022neural,
title={A Neural Corpus Indexer for Document Retrieval},
author={Wang, Yujing and Hou, Yingyan and Wang, Haonan and Miao, Ziming and Wu, Shibin and Sun, Hao and Chen, Qi and Xia, Yuqing and Chi, Chengmin and Zhao, Guoshuai and others},
journal={arXiv preprint arXiv:2206.02743},
year={2022}
}
We learned a lot and borrowed some code from the following projects when building NCI.