Extending the MINERS Framework

This fork was developed for Ontario Tech University's CSCI 6720 Group Project, focusing on extending MINERS Framework's ICL classification and Deep Passage Retrieval tasks.

Group Members:

Alexie Linardatos([email protected])
Madhav Goyani([email protected])
Zikun Fu([email protected])

A 3 minute video summary on this project can be found here: Link

The final report can be found here: Link

🔧 Environment Setup

pip install -r requirements.txt

Microsoft Visual C++ 14.0 or greater is required to run the project. You can download it from Microsoft's website.

📝 Experiment Logs

Full Experiment logs can be accessed here.

🚀 Running Experiments

ICL Classification

❱❱❱ python icl_NER.py --dataset {dataset} --seed 42 --model_checkpoint {model} --gen_model_checkpoint {gen_model_checkpoint}  --cuda --load_in_8bit --k {k}
❱❱❱ python icl_NER.py --dataset masakhaner --seed 42 --model_checkpoint sentence-transformers/LaBSE --gen_model_checkpoint meta-llama/Meta-Llama-3.1-8B-Instruct --cuda --load_in_8bit --k 2

Dense Passage Retrieval

Download the corpus:

!wget https://huggingface.co/datasets/miracl/miracl-corpus/resolve/main/miracl-corpus-v1.0-bn/docs-0.jsonl.gz

Chunk and Encode the downloaded corpus:

python chunk_dataset.py
bash encode_corpus.sh

Run the evaluation:

bash search_eval.sh

💻 Models Support

All models used for the experiments are listed below:

Encoder LMs and APIs

Open-source LMs:

Generative LMs:

Gemma 2 Instruct google/gemma-2-9b-it
Llama 3 8B Instruct meta-llama/Meta-Llama-3.1-8B-Instruct

📜 Credits

OpenSub bitext mining dataset
Masakhaner NER dataset
MIRACL dataset

Framework code based on the MINERS paper:

@article{winata2024miners,
title={MINERS: Multilingual Language Models as Semantic Retrievers},
author={Winata, Genta Indra and Zhang, Ruochen and Adelani, David Ifeoluwa},
journal={arXiv preprint arXiv:2406.07424},
year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
DPR		DPR
assets		assets
datasets		datasets
outputs/save_bitext		outputs/save_bitext
results		results
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
bitext.py		bitext.py
bitext_ensemble.py		bitext_ensemble.py
classification.py		classification.py
classification_ensemble.py		classification_ensemble.py
finetune_baseline.py		finetune_baseline.py
icl.py		icl.py
icl_NER.py		icl_NER.py
icl_open_end_gen.py		icl_open_end_gen.py
icl_percentile.py		icl_percentile.py
requirements.txt		requirements.txt
utils.py		utils.py
visualize.py		visualize.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Extending the MINERS Framework

🔧 Environment Setup

📝 Experiment Logs

🚀 Running Experiments

ICL Classification

Dense Passage Retrieval

💻 Models Support

Encoder LMs and APIs

Open-source LMs:

Generative LMs:

📜 Credits

About

Languages

License

ZikunFu/miners

Folders and files

Latest commit

History

Repository files navigation

Extending the MINERS Framework

🔧 Environment Setup

📝 Experiment Logs

🚀 Running Experiments

ICL Classification

Dense Passage Retrieval

💻 Models Support

Encoder LMs and APIs

Open-source LMs:

Generative LMs:

📜 Credits

About

Topics

Resources

License

Stars

Watchers

Forks

Languages