You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
| `mmlu_anatomy` | Massive Multitask Language Understanding benchmark for broad domain language evaluation. Several variants are supported.
34
33
| `mmlu_pro_law` | Massive Multitask Language Understanding benchmark for broad domain language evaluation. Several variants are supported.
35
34
| `mmlu_pro_plus_law` | Massive Multitask Language Understanding benchmark for broad domain language evaluation. Several variants are supported.
@@ -40,11 +39,63 @@ Tasks designed to predict the endings of text passages, testing language predict
40
39
| `social_iqa` | Social Interaction Question Answering to evaluate common sense and social reasoning.
41
40
| `triviaqa` | A large-scale dataset for trivia question answering to test general knowledge.
42
41
| `truthfulqa_mc2` | A QA task aimed at evaluating the truthfulness and factual accuracy of model responses.
43
-
| `wikitext` | wikitext Tasks based on text from Wikipedia articles to assess language modeling and generation.
42
+
| `wikitext` | Tasks based on text from Wikipedia articles to assess language modeling and generation.
44
43
| `winogrande` | A large-scale dataset for coreference resolution, inspired by the Winograd Schema Challenge.
45
-
| `wmdp_bio` | wmdp A benchmark with the objective of minimizing performance, based on potentially-sensitive multiple-choice knowledge questions.
44
+
| `wmdp_bio` | A benchmark with the objective of minimizing performance, based on potentially-sensitive multiple-choice knowledge questions.
46
45
| `wsc273` | The Winograd Schema Challenge, a test of commonsense reasoning and coreference resolution.
47
46
| `xlsum_es` | Collection of tasks in Spanish encompassing various evaluation areas.
48
47
| `xnli_tr` | Cross-Lingual Natural Language Inference to test understanding across different languages.
49
48
| `xwinograd_zh` | Cross-lingual Winograd schema tasks for coreference resolution in multiple languages.
50
49
|===
50
+
51
+
=== Tier 2 Tasks
52
+
These tasks are functional but may lack full CI coverage and comprehensive testing. Community support with fixes will be provided as needed but are limited.(footnote:[Tier 2 tasks were selected according to their popularity (above the 70th percentile of downloads but <10,0000 downloads on HuggingFace).]).
53
+
54
+
[cols="1,2a", options="header"]
55
+
|===
56
+
|Name |https://github.com/opendatahub-io/lm-evaluation-harness/tree/incubation/lm_eval/tasks[Task Group Description]
57
+
| `mathqa` | The MathQA dataset, as a multiple choice dataset where the answer choices are not in context.
58
+
| `logiqa` | Logical reasoning reading comprehension with multi-sentence passages and multiple-choice questions.
59
+
| `arabicmmlu_driving_test` | Evaluates all ArabicMMLU tasks.
60
+
| `drop` | A QA dataset which tests comprehensive understanding of paragraphs.
61
+
| `leaderboard_musr_team_allocation` | A dataset for evaluating language models on multistep soft reasoning tasks specified in a natural language narrative.
62
+
| `anli_r2` | A dataset collected via an iterative, adversarial human-and-model-in-the-loop procedure.
63
+
| `iwslt2017-en-ar` | Machine translation from English to Arabic (IWSLT 2017).
64
+
| `hrm8k_ksm` | Multistep quantitative reasoning problems in Korean standardized math.
65
+
| `nq_open` | A Question Answering dataset based on aggregated user queries from Google Search.
66
+
| `tinyMMLU` | A small, representative subset of MMLU for quick evaluation across subjects.
67
+
| `race` | A dataset collected from English examinations in China, which are designed for middle school and high school students.
68
+
| `kobest_wic` | Korean WiC (Word-in-Context) word sense disambiguation from the KoBEST benchmark.
69
+
| `mgsm_en_cot_te` | Multilingual GSM English variant using chain-of-thought teacher signals for math word problems.
70
+
| `kmmlu_hard_law` | Korean MMLU hard split focused on the Law subject category.
71
+
| `blimp_passive_2` | BLiMP minimal pairs targeting English grammar acceptability for passive-voice constructions (set 2).
72
+
| `pubmedqa` | Biomedical question answering from PubMed abstracts (yes/no/maybe).
73
+
| `wmt14-en-fr` | Machine translation from English to French (WMT14).
74
+
| `paws_zh` | PAWS-X Chinese paraphrase identification with high lexical overlap pairs.
75
+
| `pile_uspto` | The Pile subset containing USPTO patent text for domain-specific language modeling.
76
+
| `medqa_4options` | Multiple-choice medical QA (USMLE-style) with four answer options.
0 commit comments