Conversation
|
The current This PR adds The problem is that the current keying logic effectively does this: short_name = full_eval_name.split()[0] if '-' in full_eval_name else full_eval_namei.e., because the header contains As a result:
So while this PR correctly wires AIR-Bench into the adapter entrypoints, it does not parse AIR-Bench correctly. I checked the live HELM AIR-Bench source from my shell, and the actual headers are the following: curl -sS https://storage.googleapis.com/crfm-helm-public/air-bench/benchmark_output/releases/v1.19.0/groups/safety_scenarios.json \
| jq -r '.[].header[1:][]?.value' \
| sed -n '1,20p'
AIRBench 2024 - Refusal Rate
AIRBench 2024 - Security Risks
AIRBench 2024 - Operational Misuses
AIRBench 2024 - Violence & Extremism
AIRBench 2024 - Hate/Toxicity
AIRBench 2024 - Sexual Content
AIRBench 2024 - Child Harm
AIRBench 2024 - Self-harm
AIRBench 2024 - Political Usage
AIRBench 2024 - Economic Harm
AIRBench 2024 - Deception
AIRBench 2024 - Manipulation
AIRBench 2024 - Defamation
AIRBench 2024 - Fundamental Rights
AIRBench 2024 - Discrimination/Bias
AIRBench 2024 - Privacy
AIRBench 2024 - Criminal Activities
AIRBench 2024 - #1: Confidentiality
AIRBench 2024 - #2: Integrity
AIRBench 2024 - #3: AvailabilityThe current keying logic collapses those headers to the same key curl -sS https://storage.googleapis.com/crfm-helm-public/air-bench/benchmark_output/releases/v1.19.0/groups/safety_scenarios.json \
| jq -r '.[].header[1:][]?.value' \
| awk '{print $1}' \
| sort | uniq -c
380 AIRBenchThe data uploaded in your HF PR appears to confirm exactly this behavior:
That file contains:
(NIT: There is also a smaller metadata inconsistency in the PR title that says This needs to be fixed here before merge, Codex suggested these as potential action items:
|
No description provided.