Add AIR-Bench leaderboard to HELM adapter by yifanmai · Pull Request #108 · evaleval/every_eval_ever

yifanmai · 2026-04-15T19:47:19Z

No description provided.

tommasocerruti · 2026-04-17T16:56:03Z

The current convert() logic is not compatible with AIR-Bench’s header format.

This PR adds HELM_AIR_Bench and routes it through the existing generic parsing path, but for AIR-Bench, the live HELM file selected by the adapter is safety_scenarios.json, and its headers all look like:

AIRBench 2024 - Refusal Rate
AIRBench 2024 - Security Risks
AIRBench 2024 - Operational Misuses
...

The problem is that the current keying logic effectively does this:

short_name = full_eval_name.split()[0] if '-' in full_eval_name else full_eval_name

i.e., because the header contains -, it returns the first whitespace splitted token, which for AIR-Bench is always AIRBench.

As a result:

only the first AIR-Bench column creates an EvaluationResult
the remaining 379 AIR-Bench scores are folded into score_details.details
AIR-Bench is therefore stored as one giant aggregated result per model instead of many distinct results

So while this PR correctly wires AIR-Bench into the adapter entrypoints, it does not parse AIR-Bench correctly.

I checked the live HELM AIR-Bench source from my shell, and the actual headers are the following:

curl -sS https://storage.googleapis.com/crfm-helm-public/air-bench/benchmark_output/releases/v1.19.0/groups/safety_scenarios.json \
| jq -r '.[].header[1:][]?.value' \
| sed -n '1,20p'

AIRBench 2024 - Refusal Rate
AIRBench 2024 - Security Risks
AIRBench 2024 - Operational Misuses
AIRBench 2024 - Violence & Extremism
AIRBench 2024 - Hate/Toxicity
AIRBench 2024 - Sexual Content
AIRBench 2024 - Child Harm
AIRBench 2024 - Self-harm
AIRBench 2024 - Political Usage
AIRBench 2024 - Economic Harm
AIRBench 2024 - Deception
AIRBench 2024 - Manipulation
AIRBench 2024 - Defamation
AIRBench 2024 - Fundamental Rights
AIRBench 2024 - Discrimination/Bias
AIRBench 2024 - Privacy
AIRBench 2024 - Criminal Activities
AIRBench 2024 - #1: Confidentiality
AIRBench 2024 - #2: Integrity
AIRBench 2024 - #3: Availability

The current keying logic collapses those headers to the same key AIRBench 380 times:

curl -sS https://storage.googleapis.com/crfm-helm-public/air-bench/benchmark_output/releases/v1.19.0/groups/safety_scenarios.json \
| jq -r '.[].header[1:][]?.value' \
| awk '{print $1}' \
| sort | uniq -c

 380 AIRBench

The data uploaded in your HF PR appears to confirm exactly this behavior:

HF PR: evaleval/EEE_datastore discussion #70
Sample raw file from that PR: example JSON

That file contains:

only 1 entry in evaluation_results
evaluation_name = "AIRBench 2024"
hundreds of AIR-Bench category scores stored under score_details.details

(NIT: There is also a smaller metadata inconsistency in the PR title that says v1.16.0, but the sampled JSON points to v1.19.0 in source_data.url).

This needs to be fixed here before merge, Codex suggested these as potential action items:

Handle AIR-Bench separately in convert()
- AIR-Bench should not use the current short_name = full_eval_name.split()[0] logic.
- The per-result key should come from something unique per column, likely the part after ' - '.
Emit separate evaluation_results for AIR-Bench metrics
- Refusal Rate, Security Risks, Operational Misuses, etc. should become distinct results, not entries inside one details blob.
Add a regression test
- A minimal AIR-Bench fixture with two headers would be enough to catch this:
  - AIRBench 2024 - Refusal Rate
  - AIRBench 2024 - Security Risks
- The test should assert that two EvaluationResults are produced.
Regenerate the HF AIR-Bench data after the adapter fix
- The currently uploaded JSONs in HF PR #70 already reflect the collapsed structure, so they would need to be regenerated after fixing the adapter.

Add AIR-Bench leaderboard to HELM adapter

bfff762

Add a separate evaluation result for each category

0ae64ec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add AIR-Bench leaderboard to HELM adapter#108

Add AIR-Bench leaderboard to HELM adapter#108
yifanmai wants to merge 2 commits intomainfrom
yifanmai/helm-air-bench

yifanmai commented Apr 15, 2026

Uh oh!

tommasocerruti commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yifanmai commented Apr 15, 2026

Uh oh!

tommasocerruti commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants