Skip to content

Add AIR-Bench leaderboard to HELM adapter#108

Open
yifanmai wants to merge 2 commits intomainfrom
yifanmai/helm-air-bench
Open

Add AIR-Bench leaderboard to HELM adapter#108
yifanmai wants to merge 2 commits intomainfrom
yifanmai/helm-air-bench

Conversation

@yifanmai
Copy link
Copy Markdown
Collaborator

No description provided.

@tommasocerruti
Copy link
Copy Markdown
Member

The current convert() logic is not compatible with AIR-Bench’s header format.

This PR adds HELM_AIR_Bench and routes it through the existing generic parsing path, but for AIR-Bench, the live HELM file selected by the adapter is safety_scenarios.json, and its headers all look like:

AIRBench 2024 - Refusal Rate
AIRBench 2024 - Security Risks
AIRBench 2024 - Operational Misuses
...

The problem is that the current keying logic effectively does this:

short_name = full_eval_name.split()[0] if '-' in full_eval_name else full_eval_name

i.e., because the header contains -, it returns the first whitespace splitted token, which for AIR-Bench is always AIRBench.

As a result:

  • only the first AIR-Bench column creates an EvaluationResult
  • the remaining 379 AIR-Bench scores are folded into score_details.details
  • AIR-Bench is therefore stored as one giant aggregated result per model instead of many distinct results

So while this PR correctly wires AIR-Bench into the adapter entrypoints, it does not parse AIR-Bench correctly.

I checked the live HELM AIR-Bench source from my shell, and the actual headers are the following:

curl -sS https://storage.googleapis.com/crfm-helm-public/air-bench/benchmark_output/releases/v1.19.0/groups/safety_scenarios.json \
| jq -r '.[].header[1:][]?.value' \
| sed -n '1,20p'

AIRBench 2024 - Refusal Rate
AIRBench 2024 - Security Risks
AIRBench 2024 - Operational Misuses
AIRBench 2024 - Violence & Extremism
AIRBench 2024 - Hate/Toxicity
AIRBench 2024 - Sexual Content
AIRBench 2024 - Child Harm
AIRBench 2024 - Self-harm
AIRBench 2024 - Political Usage
AIRBench 2024 - Economic Harm
AIRBench 2024 - Deception
AIRBench 2024 - Manipulation
AIRBench 2024 - Defamation
AIRBench 2024 - Fundamental Rights
AIRBench 2024 - Discrimination/Bias
AIRBench 2024 - Privacy
AIRBench 2024 - Criminal Activities
AIRBench 2024 - #1: Confidentiality
AIRBench 2024 - #2: Integrity
AIRBench 2024 - #3: Availability

The current keying logic collapses those headers to the same key AIRBench 380 times:

curl -sS https://storage.googleapis.com/crfm-helm-public/air-bench/benchmark_output/releases/v1.19.0/groups/safety_scenarios.json \
| jq -r '.[].header[1:][]?.value' \
| awk '{print $1}' \
| sort | uniq -c

 380 AIRBench

The data uploaded in your HF PR appears to confirm exactly this behavior:

That file contains:

  • only 1 entry in evaluation_results
  • evaluation_name = "AIRBench 2024"
  • hundreds of AIR-Bench category scores stored under score_details.details

(NIT: There is also a smaller metadata inconsistency in the PR title that says v1.16.0, but the sampled JSON points to v1.19.0 in source_data.url).

This needs to be fixed here before merge, Codex suggested these as potential action items:

  1. Handle AIR-Bench separately in convert()

    • AIR-Bench should not use the current short_name = full_eval_name.split()[0] logic.
    • The per-result key should come from something unique per column, likely the part after ' - '.
  2. Emit separate evaluation_results for AIR-Bench metrics

    • Refusal Rate, Security Risks, Operational Misuses, etc. should become distinct results, not entries inside one details blob.
  3. Add a regression test

    • A minimal AIR-Bench fixture with two headers would be enough to catch this:
      • AIRBench 2024 - Refusal Rate
      • AIRBench 2024 - Security Risks
    • The test should assert that two EvaluationResults are produced.
  4. Regenerate the HF AIR-Bench data after the adapter fix

    • The currently uploaded JSONs in HF PR #70 already reflect the collapsed structure, so they would need to be regenerated after fixing the adapter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants