You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
10-samples test is aimed at validating the environment
run colbert embedding process enwiki-20230401-10samples.tsv
Change root path for variables: checkpoint, index_dbPath, collection in
wiki2023-10samples_tsv-2-colbert_embedding.py. Colbert enforces the use of absolute paths, so you need to modify the paths for the following three variables
cd RAGLAB
sh run/wiki2023_preprocess/2-wiki2023-10samples_tsv-2-colbert_embedding.sh
Embedding precess will take around 15mins in first time.
The first time colbert processes embeddings, it takes a relatively long time because it needs to recompile the torch_extensions. However, calling the processed embeddings does not require a long time. If there are no errors and the retrieved text can be printed, it indicates that the environment is correct.
Note: RAGLAB already provides enwiki2023 source data on HuggingFace, so there's no need to download it again. This information is just to provide the source of the data.
download method: install throuth gdown
cd RAGLAB/data/retrieval/colbertv2.0_passages
mkdir wiki2023
pip install gdown
gdown --id 1mekls6OGOKLmt7gYtHs0WGf5oTamTNat
preprocess wiki2023
If the 10-samples test is passed successfully, you can proceed with processing wiki2023.
preprocess .db -> .tsv (Colbert can only read files in .tsv format.)
cd RAGLAB
sh run/wiki2023_preprocess/3-wiki2023_db-2-tsv.sh
.tsv -> embedding
remember to change the root path of checkpoint, index_dbPath and collection