⚡ A repository for evaluating Multilingual LLMs in various tasks 🚀 ⚡
⚡ SeaEval for Multilingual Foundation Models: From Cross-Lingual Alignment to Cultural Reasoning 🚀 ⚡
- Dec 2024: Add Model-as-judge to inference in a batch manner for MCQ questions. Speech up to 5 times for model-as-judge period.
- Aug 2024: It is non-trivial to evaluate the generated text with the reference. For multiple choice questions, we changed the evaluation metric for the text generation tasks to Model-as-judge.
- July 2024: We are building SeaEval v2! With mixed prompts templates and more diverse datasets. v1 moved to v1-branch.
Installation with pip:
pip install -r requirements.txt
# Host the judgement model on port 5000, this is a more accurate matching method for MCQs.
bash host_model_judge_llama_3_70b_instruct.sh
The example is for a Llama-3-8B-Instruct
model on mmlu
dataset.
# The example is done with 1 A100 40G GPUs.
# This is a setting for just using 50 samples for evaluation.
MODEL_NAME=Meta-Llama-3-8B-Instruct
GPU=0
BATCH_SIZE=4
EVAL_MODE=zero_shot
OVERWRITE=True
NUMBER_OF_SAMPLES=50
DATASET=mmlu
bash eval.sh $DATASET $MODEL_NAME $BATCH_SIZE $EVAL_MODE $OVERWRITE $NUMBER_OF_SAMPLES $GPU
# Results:
# The results would be like:
# {
# "accuracy": 0.507615302109403,
# "category_acc": {
# "high_school_european_history": 0.6585365853658537,
# "business_ethics": 0.6161616161616161,
# "clinical_knowledge": 0.5,
# "medical_genetics": 0.5555555555555556,
# ...
A full list of dataset names can be found here.
Dataset | Metrics | Status |
---|---|---|
cross_xquad | AC3, Consistency, Accuracy | ✅ |
cross_mmlu | AC3, Consistency, Accuracy | ✅ |
cross_logiqa | AC3, Consistency, Accuracy | ✅ |
sg_eval | Accuracy | ✅ |
cn_eval | Accuracy | ✅ |
us_eval | Accuracy | ✅ |
ph_eval | Accuracy | ✅ |
flores_ind2eng | BLEU | ✅ |
flores_vie2eng | BLEU | ✅ |
flores_zho2eng | BLEU | ✅ |
flores_zsm2eng | BLEU | ✅ |
mmlu | Accuracy | ✅ |
c_eval | Accuracy | ✅ |
cmmlu | Accuracy | ✅ |
zbench | Accuracy | ✅ |
indommlu | Accuracy | ✅ |
ind_emotion | Accuracy | ✅ |
ocnli | Accuracy | ✅ |
c3 | Accuracy | ✅ |
dream | Accuracy | ✅ |
samsum | ROUGE | ✅ |
dialogsum | ROUGE | ✅ |
sst2 | Accuracy | ✅ |
cola | Accuracy | ✅ |
qqp | Accuracy | ✅ |
mnli | Accuracy | ✅ |
qnli | Accuracy | ✅ |
wnli | Accuracy | ✅ |
rte | Accuracy | ✅ |
mrpc | Accuracy | ✅ |
sg_eval_v1_cleaned | Accuracy | ✅ |
Model | Size | Mode | Status |
---|---|---|---|
Meta-Llama-3-8B-Instruct | 8B | 0-Shot | ✅ |
Meta-Llama-3-70B-Instruct | 70B | 0-Shot | ✅ |
Meta-Llama-3.1-8B-Instruct | 8B | 0-Shot | ✅ |
Qwen2-7B-Instruct | 7B | 0-Shot | ✅ |
Qwen2-72B-Instruct | 72B | 0-Shot | ✅ |
Meta-Llama-3-8B | 8B | 5-Shot | ✅ |
Meta-Llama-3-70B | 70B | 5-Shot | ✅ |
Meta-Llama-3.1-8B | 8B | 5-Shot | ✅ |
To use SeaEval to evaluate your own model, you can just add your model to model.py
and model_src
accordingly.
If you find our work useful, please consider citing our paper!
SeaEval for Multilingual Foundation Models: From Cross-Lingual Alignment to Cultural Reasoning
@article{SeaEval,
title={SeaEval for Multilingual Foundation Models: From Cross-Lingual Alignment to Cultural Reasoning},
author={Wang, Bin and Liu, Zhengyuan and Huang, Xin and Jiao, Fangkai and Ding, Yang and Aw, Ai Ti and Chen, Nancy F.},
journal={NAACL},
year={2024}
}
CRAFT: Extracting and Tuning Cultural Instructions from the Wild
@article{wang2024craft,
title={CRAFT: Extracting and Tuning Cultural Instructions from the Wild},
author={Wang, Bin and Lin, Geyu and Liu, Zhengyuan and Wei, Chengwei and Chen, Nancy F},
journal={ACL 2024 - C3NLP Workshop},
year={2024}
}
CrossIn: An Efficient Instruction Tuning Approach for Cross-Lingual Knowledge Alignment
@article{lin2024crossin,
title={CrossIn: An Efficient Instruction Tuning Approach for Cross-Lingual Knowledge Alignment},
author={Lin, Geyu and Wang, Bin and Liu, Zhengyuan and Chen, Nancy F},
journal={arXiv preprint arXiv:2404.11932},
year={2024}
}
Contact: [email protected]