Skip to content
/ miners Public
forked from gentaiscool/miners

MINERS Fork⛏️: OTU CSCI 6720 - Extending on ICL and DPR

License

Notifications You must be signed in to change notification settings

ZikunFu/miners

 
 

Repository files navigation

Extending the MINERS Framework

This fork was developed for Ontario Tech University's CSCI 6720 Group Project, focusing on extending MINERS Framework's ICL classification and Deep Passage Retrieval tasks.

Group Members:


A 3 minute video summary on this project can be found here: Link

The final report can be found here: Link

🔧 Environment Setup

pip install -r requirements.txt

Microsoft Visual C++ 14.0 or greater is required to run the project. You can download it from Microsoft's website.

📝 Experiment Logs

Full Experiment logs can be accessed here.

🚀 Running Experiments

ICL Classification

❱❱❱ python icl_NER.py --dataset {dataset} --seed 42 --model_checkpoint {model} --gen_model_checkpoint {gen_model_checkpoint}  --cuda --load_in_8bit --k {k}
❱❱❱ python icl_NER.py --dataset masakhaner --seed 42 --model_checkpoint sentence-transformers/LaBSE --gen_model_checkpoint meta-llama/Meta-Llama-3.1-8B-Instruct --cuda --load_in_8bit --k 2

Dense Passage Retrieval

Download the corpus:

!wget https://huggingface.co/datasets/miracl/miracl-corpus/resolve/main/miracl-corpus-v1.0-bn/docs-0.jsonl.gz

Chunk and Encode the downloaded corpus:

python chunk_dataset.py
bash encode_corpus.sh

Run the evaluation:

bash search_eval.sh

💻 Models Support

All models used for the experiments are listed below:

Encoder LMs and APIs

Open-source LMs:

Generative LMs:

📜 Credits

  • OpenSub bitext mining dataset
  • Masakhaner NER dataset
  • MIRACL dataset
  • Framework code based on the MINERS paper:
    @article{winata2024miners,
    title={MINERS: Multilingual Language Models as Semantic Retrievers},
    author={Winata, Genta Indra and Zhang, Ruochen and Adelani, David Ifeoluwa},
    journal={arXiv preprint arXiv:2406.07424},
    year={2024}
    }