This issue is for collecting ideas on what metadata should be part of the BenchmarkCards, beyond what the schema currently covers. Open for everyone to contribute.
From the last meeting, two things came up that are currently missing from the cards:
Capabilities vs risks categorization
The cards currently include risk mappings but don't classify benchmarks at a higher level into capabilities vs risks. This was flagged as important for the EvalCard frontend, where people want to see benchmarks organized by what they actually measure. A few starting points were mentioned in the meeting:
- The Eval Factsheets categorization from the Meta paper
- The IBM capabilities taxonomy already in AI Atlas Nexus
- The clustering approach from the survey paper that groups benchmarks by what they measure
Domain taxonomy
Some benchmarks are domain-specific (medical, legal, code, etc.). HF dataset cards sometimes have this but not consistently. Anna's BenchNavigator dataset might already cover a lot of this.
How to contribute
If you have ideas for fields, categories, or taxonomies that should be included, please comment here. Help is welcome on:
- Reviewing what's already in the BenchmarkCard schema and mapping it against what's missing
- Exploring what's feasible to populate automatically from existing sources (Anna's dataset, HF metadata, IBM taxonomies)
- Proposing additional metadata fields that would be useful
This issue is for collecting ideas on what metadata should be part of the BenchmarkCards, beyond what the schema currently covers. Open for everyone to contribute.
From the last meeting, two things came up that are currently missing from the cards:
Capabilities vs risks categorization
The cards currently include risk mappings but don't classify benchmarks at a higher level into capabilities vs risks. This was flagged as important for the EvalCard frontend, where people want to see benchmarks organized by what they actually measure. A few starting points were mentioned in the meeting:
Domain taxonomy
Some benchmarks are domain-specific (medical, legal, code, etc.). HF dataset cards sometimes have this but not consistently. Anna's BenchNavigator dataset might already cover a lot of this.
How to contribute
If you have ideas for fields, categories, or taxonomies that should be included, please comment here. Help is welcome on: