IRT-on-Bench

A Python package for analyzing language model benchmark data using Item Response Theory (IRT).

Features

Load and preprocess language model benchmark results
Compute score matrices for model evaluation
Fit classical and Bayesian IRT models (Rasch, 2PL)

Installation

From GitHub

pip install git+https://github.com/lamalab-org/irt-on-bench.git

Development Installation

# Clone the repository
git clone https://github.com/lamalab-org/irt-on-bench.git
cd irt-on-bench

# Install in development mode
pip install -e .

Usage

Running IRT Analysis

You can run a full IRT analysis on your benchmark data using the command-line interface:

# Run basic analysis with default parameters
irt-on-bench --data_path=../data/filtered_model_score_dict.pkl --output_path=../results/

# Run with classical IRT methods instead of PyMC
irt-on-bench --data_path=../data/filtered_model_score_dict.pkl --pymc=False

# with all option
irt-on-bench \
  --data_path=./data/filtered_model_score_dict.pkl \
  --output_path=./results/ \
  --model=2pl \
  --pymc=true \
  --n_samples=2000 \
  --chains=4 \
  --cores=2 \
  --force=true

Python API

import pandas as pd
from irt_on_bench.metadata import BinaryQuestionMetadata
from irt_on_bench.models import fit_2pl_pymc
from irt_on_bench.cli import setup_analyzer_from_scores


analyzer, models = setup_analyzer_from_scores("your_data.pkl", metric_name="all_correct")

# Create question metadata
for i in range(binary_array.shape[0]):
    analyzer.add_question_metadata(BinaryQuestionMetadata(f"q{i}"))

# Fit IRT model
irt_results = analyzer.fit_irt(model='2pl')
print(irt_results['difficulties'],irt_results['discriminations'],irt_results['abilities'])

# Alternatively, use Bayesian IRT with PyMC
trace = fit_2pl_pymc(binary_array.T)

Analyzing Results

The package provides utilities for analyzing and visualizing IRT results:

from irt_on_bench.analysis import item_fit_statistics, identify_misfitting_items
from irt_on_bench.analyzer import BenchmarkAnalyzer

# Calculate item fit statistics
item_stats = item_fit_statistics(binary_array, abilities, difficulties, discriminations)

# Identify misfitting items
misfitting_items = identify_misfitting_items(item_stats)

# Analyze extreme items
question_ids = [f"q{i}" for i in range(len(difficulties))]
extreme_items = BenchmarkAnalyzer.analyze_extreme_items(
    difficulties, 
    discriminations, 
    question_ids
)

Data Format

The package expects model scores in a specific format:

{
  "overall": {
    "model_name_1": {
      metric_name: pandas.Series,
      ...
    },
    "model_name_2": {
      metric_name: pandas.Series,
      ...
    },
    ...
  },
  ...
}

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License.

Citation

If you use this software in your research, please cite:

@software{irt-on-bench,
  author = {Schilling-Wilhelmi, Mara and Alampara, Nawaf and Jablonka, Kevin M.},
  title = {IRT-on-Bench: Item Response Theory for Language Model Benchmarks},
  year = {2025},
  url = {https://github.com/lamalab-org/irt-on-bench}
}

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.github/workflows		.github/workflows
data		data
plots		plots
results		results
scripts		scripts
src/irt_on_bench		src/irt_on_bench
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
filtered_model_score_dict.pkl		filtered_model_score_dict.pkl
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

IRT-on-Bench

Features

Installation

From GitHub

Development Installation

Usage

Running IRT Analysis

Python API

Analyzing Results

Data Format

Contributing

License

Citation

About

Releases

Packages

Contributors 3

Languages

License

lamalab-org/irt-on-bench

Folders and files

Latest commit

History

Repository files navigation

IRT-on-Bench

Features

Installation

From GitHub

Development Installation

Usage

Running IRT Analysis

Python API

Analyzing Results

Data Format

Contributing

License

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages