This fork was developed for Ontario Tech University's CSCI 6720 Group Project, focusing on extending MINERS Framework's ICL classification and Deep Passage Retrieval tasks.
Group Members:
- Alexie Linardatos([email protected])
- Madhav Goyani([email protected])
- Zikun Fu([email protected])
A 3 minute video summary on this project can be found here: Link
The final report can be found here: Link
pip install -r requirements.txt
Microsoft Visual C++ 14.0 or greater is required to run the project. You can download it from Microsoft's website.
Full Experiment logs can be accessed here.
❱❱❱ python icl_NER.py --dataset {dataset} --seed 42 --model_checkpoint {model} --gen_model_checkpoint {gen_model_checkpoint} --cuda --load_in_8bit --k {k}
❱❱❱ python icl_NER.py --dataset masakhaner --seed 42 --model_checkpoint sentence-transformers/LaBSE --gen_model_checkpoint meta-llama/Meta-Llama-3.1-8B-Instruct --cuda --load_in_8bit --k 2
Download the corpus:
!wget https://huggingface.co/datasets/miracl/miracl-corpus/resolve/main/miracl-corpus-v1.0-bn/docs-0.jsonl.gz
Chunk and Encode the downloaded corpus:
python chunk_dataset.py
bash encode_corpus.sh
Run the evaluation:
bash search_eval.sh
All models used for the experiments are listed below:
- Gemma 2 Instruct google/gemma-2-9b-it
- Llama 3 8B Instruct meta-llama/Meta-Llama-3.1-8B-Instruct
- OpenSub bitext mining dataset
- Masakhaner NER dataset
- MIRACL dataset
- Framework code based on the MINERS paper:
@article{winata2024miners, title={MINERS: Multilingual Language Models as Semantic Retrievers}, author={Winata, Genta Indra and Zhang, Ruochen and Adelani, David Ifeoluwa}, journal={arXiv preprint arXiv:2406.07424}, year={2024} }