BiasEval: A Cross-language Bias Evaluation on Large Language Models

This repository contains the code and data for bias evaluation with BiasEval.

This BiasEval project assesses 18 diverse LLMs developed in various countries, using four bias benchmark datasets, across two linguistic settings (English and Chinese), and employing two assessment methods: Single-Choice and Open-Question assessments.

💡 BiasEval leaderboard: BiasEval Leaderboard.

Setup and install Dependencies

Before evaluation, you should clone our code and install dependencies using the following command:

git clone https://github.com/xinyiqin/Bias_eval.git
cd Bias_eval-main
pip install -r requirements.txt

API keys

If you want to reproduce the evaluation results of a specific model involved in this project, please fill the corresponding api keys in .global.sh file.

Dataset

The evaluation datasets are now stored in the ./data folder.

The possible answers are stored in the ./expr folder.

Inference

Single-choice inference

Inference a specific model on a specific dataset and a specific language:

For example, inference gpt-4o on crowspairs dataset in English with 32 workers, you should run:

bash inference.sh gpt-4o crowspairs en 32

Inference a specific model on multiple datasets and multiple languages, run:

bash inference.sh gpt-4o "crowspairs stereoset bbq cbbq" "en zh" 32

Open-Question inference

bash inference_qa.sh gpt-4o cbbq "en zh" 32

Evaluation

Single-choice evaluation

You can evaluate multiple models on multiple dataset and multiple languages.

For example:

bash evaluate.sh "abab-5.5 qwen-7b" "crowspairs stereoset" "en zh"

The evaluation results will be stored in the ./eval folder.

Open-Question evaluation

You can evaluate multiple models on multiple languages. You can use any gpt model as the evaluator.

For example, we use gpt-4o as the evaluator:

bash gpt_evaluate.sh "ernie-3.5" gpt-4o-2024-05-13 "en zh" cbbq 6

The evaluation results will be stored in the ./eval folder.

Ranking

This step integrates all model results stored in the ./eval folder and rank them. Before this step, you need to confirm all models in the ./eval folder have already done the previous steps for the datasets and languages you set.

bash evaluate_all.sh "stereoset bbq crowspairs cbbq cbbq_qa" "en zh"

The ranking excel will be stored in the ./eval/leaderboard folder.

Evaluation for your model

Add the model inference code to the ./inference/generator file or ./inference/generator_open_source file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BiasEval: A Cross-language Bias Evaluation on Large Language Models

Setup and install Dependencies

API keys

Dataset

Inference

Single-choice inference

Open-Question inference

Evaluation

Single-choice evaluation

Open-Question evaluation

Ranking

Evaluation for your model

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
__pycache__		__pycache__
data		data
eval		eval
evaluate		evaluate
expr		expr
gpt_evaluator		gpt_evaluator
inference		inference
leaderboard		leaderboard
.gitignore		.gitignore
README.md		README.md
evaluate.sh		evaluate.sh
evaluate_all.sh		evaluate_all.sh
global.sh		global.sh
gpt_evaluate.sh		gpt_evaluate.sh
index.html		index.html
inference.sh		inference.sh
inference_qa.sh		inference_qa.sh
requirements.txt		requirements.txt
result_analysis.ipynb		result_analysis.ipynb

xinyiqin/Bias_eval

Folders and files

Latest commit

History

Repository files navigation

BiasEval: A Cross-language Bias Evaluation on Large Language Models

Setup and install Dependencies

API keys

Dataset

Inference

Single-choice inference

Open-Question inference

Evaluation

Single-choice evaluation

Open-Question evaluation

Ranking

Evaluation for your model

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages