One-off adapter scripts that fetch leaderboard data from external sources and convert it to the Every Eval Ever schema. These are run manually, not via the main CLI.
Each adapter is run with uv run python -m utils.<name>.adapter.
| Adapter | Data Source | Description |
|---|---|---|
arc_agi |
ARC Prize leaderboard JSON | Converts ARC-AGI leaderboard data and merges canonical model aliases. |
artificial_analysis |
Artificial Analysis LLM API | Converts Artificial Analysis LLM benchmark, pricing, and performance results into data/artificial-analysis-llms/. |
vals_ai |
Vals.ai benchmark leaderboards | Scrapes Vals.ai benchmark pages and converts their embedded leaderboard results into data/vals-ai/. |
bfcl |
BFCL leaderboard CSV | Converts BFCL leaderboard data with per-metric evaluation names and bounded continuous scores. |
sciarena |
SciArena leaderboard API | Converts SciArena leaderboard results. |
global-mmlu-lite |
Kaggle API | Fetches Global MMLU Lite leaderboard results from Kaggle. |
hfopenllm_v2 |
HuggingFace Spaces API | Fetches the Open LLM Leaderboard v2 (4576+ models). |
helm |
HELM leaderboard | Converts HELM leaderboard data. Supports --leaderboard_name for Capabilities/Lite/Classic/Instruct/MMLU. |
llm_stats |
LLM Stats API | Converts LLM Stats model, benchmark, and score API data into data/llm-stats/. |
mt_bench |
LMSYS / FastChat | Converts MT-Bench GPT-4 single-answer judgments into data/mt-bench/. Emits overall, turn-1, and turn-2 means per model. |
openeval |
HuggingFace | Converts OpenEval response scores from human-centered-eval/OpenEval into data/openeval/; pass --include-instances to also write *_samples.jsonl sidecars. |
rewardbench |
HuggingFace | Fetches RewardBench v1 (CSV) and RewardBench v2 (JSON) leaderboard data. |
terminal_bench_2 |
tbench.ai | Fetches Terminal-Bench 2.0 agentic coding benchmark results. |
hle |
Scale SEAL leaderboard | Converts the Scale SEAL Humanity's Last Exam leaderboard into data/hle/. Emits per-model accuracy (with 95% CI) and calibration error. |
mmlu_pro |
TIGER-Lab leaderboard CSV | Converts the MMLU-Pro leaderboard (TIGER-Lab/mmlu_pro_leaderboard_submission) into data/mmlu-pro/. Emits per-model overall + 14 per-subject accuracies. |
- These are one-off scripts, not integrated into the main CLI.
- They require network access to fetch live leaderboard data.
- Some adapters (e.g.
rewardbench,helm) may take several minutes to complete due to the number of models. - Run
uv run python -m utils.<name>.adapter --helpfor adapter-specific options. - The script for livecodebenchpro is out-dated and will be updated at a later date.
- Generated adapter outputs under
data/<source>/and saved raw payloads are generated artifacts. Prefer temporary output paths for smoke runs unless a data refresh is intentionally part of the change.
Run a live smoke export from the repository root, writing generated output outside the repo:
uv run python -m utils.vals_ai.adapter --output-dir /tmp/eee-vals-aiTo intentionally prepare a data refresh, use --output-dir data/vals-ai and
validate the result before deciding whether to include generated files.
For smaller smoke runs, fetch one benchmark:
uv run python -m utils.vals_ai.adapter \
--benchmark finance_agent \
--output-dir /tmp/eee-vals-ai-smoke \
--save-raw-json /tmp/eee-vals-ai-raw.jsonReplay a saved normalized payload without hitting the network:
uv run python -m utils.vals_ai.adapter \
--input-json /tmp/eee-vals-ai-raw.json \
--output-dir /tmp/eee-vals-ai-replayValidate generated records with:
uv run python -m every_eval_ever validate /tmp/eee-vals-ai-smoke