Name	Name	Last commit message	Last commit date
parent directory ..
arc_agi	arc_agi
artificial_analysis	artificial_analysis
bfcl	bfcl
cocoabench	cocoabench
exgentic	exgentic
global-mmlu-lite	global-mmlu-lite
hal	hal
helm	helm
hfopenllm_v2	hfopenllm_v2
hle	hle
livecodebenchpro	livecodebenchpro
llm_stats	llm_stats
mmlu_pro	mmlu_pro
mt_bench	mt_bench
multi_swe_bench	multi_swe_bench
openeval	openeval
rewardbench	rewardbench
sciarena	sciarena
swe_bench_verified	swe_bench_verified
swe_polybench	swe_polybench
terminal_bench_2	terminal_bench_2
vals_ai	vals_ai
__init__.py	__init__.py
README.md	README.md
swe_helpers.py	swe_helpers.py

Adapters

One-off adapter scripts that fetch leaderboard data from external sources and convert it to the Every Eval Ever schema. These are run manually, not via the main CLI.

Usage

Each adapter is run with uv run python -m utils.<name>.adapter.

Adapters

Adapter	Data Source	Description
`arc_agi`	ARC Prize leaderboard JSON	Converts ARC-AGI leaderboard data and merges canonical model aliases.
`artificial_analysis`	Artificial Analysis LLM API	Converts Artificial Analysis LLM benchmark, pricing, and performance results into `data/artificial-analysis-llms/`.
`vals_ai`	Vals.ai benchmark leaderboards	Scrapes Vals.ai benchmark pages and converts their embedded leaderboard results into `data/vals-ai/`.
`bfcl`	BFCL leaderboard CSV	Converts BFCL leaderboard data with per-metric evaluation names and bounded continuous scores.
`sciarena`	SciArena leaderboard API	Converts SciArena leaderboard results.
`global-mmlu-lite`	Kaggle API	Fetches Global MMLU Lite leaderboard results from Kaggle.
`hfopenllm_v2`	HuggingFace Spaces API	Fetches the Open LLM Leaderboard v2 (4576+ models).
`helm`	HELM leaderboard	Converts HELM leaderboard data. Supports `--leaderboard_name` for Capabilities/Lite/Classic/Instruct/MMLU.
`llm_stats`	LLM Stats API	Converts LLM Stats model, benchmark, and score API data into `data/llm-stats/`.
`mt_bench`	LMSYS / FastChat	Converts MT-Bench GPT-4 single-answer judgments into `data/mt-bench/`. Emits overall, turn-1, and turn-2 means per model.
`openeval`	HuggingFace	Converts OpenEval response scores from `human-centered-eval/OpenEval` into `data/openeval/`; pass `--include-instances` to also write `*_samples.jsonl` sidecars.
`rewardbench`	HuggingFace	Fetches RewardBench v1 (CSV) and RewardBench v2 (JSON) leaderboard data.
`terminal_bench_2`	tbench.ai	Fetches Terminal-Bench 2.0 agentic coding benchmark results.
`hle`	Scale SEAL leaderboard	Converts the Scale SEAL Humanity's Last Exam leaderboard into `data/hle/`. Emits per-model accuracy (with 95% CI) and calibration error.
`mmlu_pro`	TIGER-Lab leaderboard CSV	Converts the MMLU-Pro leaderboard (`TIGER-Lab/mmlu_pro_leaderboard_submission`) into `data/mmlu-pro/`. Emits per-model overall + 14 per-subject accuracies.

Notes

These are one-off scripts, not integrated into the main CLI.
They require network access to fetch live leaderboard data.
Some adapters (e.g. rewardbench, helm) may take several minutes to complete due to the number of models.
Run uv run python -m utils.<name>.adapter --help for adapter-specific options.
The script for livecodebenchpro is out-dated and will be updated at a later date.
Generated adapter outputs under data/<source>/ and saved raw payloads are generated artifacts. Prefer temporary output paths for smoke runs unless a data refresh is intentionally part of the change.

Vals.ai

Run a live smoke export from the repository root, writing generated output outside the repo:

uv run python -m utils.vals_ai.adapter --output-dir /tmp/eee-vals-ai

To intentionally prepare a data refresh, use --output-dir data/vals-ai and validate the result before deciding whether to include generated files.

For smaller smoke runs, fetch one benchmark:

uv run python -m utils.vals_ai.adapter \
  --benchmark finance_agent \
  --output-dir /tmp/eee-vals-ai-smoke \
  --save-raw-json /tmp/eee-vals-ai-raw.json

Replay a saved normalized payload without hitting the network:

uv run python -m utils.vals_ai.adapter \
  --input-json /tmp/eee-vals-ai-raw.json \
  --output-dir /tmp/eee-vals-ai-replay

Validate generated records with:

uv run python -m every_eval_ever validate /tmp/eee-vals-ai-smoke

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Adapters

Usage

Adapters

Notes

Vals.ai

FilesExpand file tree

utils

Directory actions

More options

Directory actions

More options

Latest commit

History

utils

Folders and files

parent directory

README.md

Adapters

Usage

Adapters

Notes

Vals.ai