Skip to content

Commit 8643157

Browse files
feat: Add LMEval Tier 2 tasks (#70)
1 parent edea626 commit 8643157

File tree

1 file changed

+58
-7
lines changed

1 file changed

+58
-7
lines changed

docs/modules/ROOT/pages/component-lm-eval.adoc

Lines changed: 58 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ These tasks are fully supported by TrustyAI with guaranteed fixes and maintenanc
1515
| `bbh` | Tasks focused on deep semantic understanding through hypothesization and reasoning.
1616
| `bbh_fewshot_snarks` | Tasks focused on deep semantic understanding through hypothesization and reasoning.
1717
| `belebele_ckb_Arab` | Language understanding tasks in a variety of languages and scripts.
18-
| `cb` | A suite of challenging tasks designed to test a range of language understanding skills.
18+
| `cb` | A suite of challenging tasks designed to test a range of language understanding skills.
1919
| `ceval-valid_law` | Tasks that evaluate language understanding and reasoning in an educational context.
2020
| `commonsense_qa` | CommonsenseQA, a multiple-choice QA dataset for measuring commonsense knowledge.
2121
| `gpqa_main_n_shot` | Tasks designed for general public question answering and knowledge verification.
@@ -24,12 +24,11 @@ These tasks are fully supported by TrustyAI with guaranteed fixes and maintenanc
2424
| `humaneval` | Code generation task that measure functional correctness for synthesizing programs from docstrings.
2525
| `ifeval` | Interactive fiction evaluation tasks for narrative understanding and reasoning.
2626
| `kmmlu_direct_law` | Knowledge-based multi-subject multiple choice questions for academic evaluation.
27-
| `lambada_openai` | Tasks designed to predict the endings of text passages, testing language prediction skills.
28-
| `lambada_standard` |
29-
Tasks designed to predict the endings of text passages, testing language prediction skills.
27+
| `lambada_openai` | Tasks designed to predict the endings of text passages, testing language prediction skills.
28+
| `lambada_standard` | Tasks designed to predict the endings of text passages, testing language prediction skills.
3029
| `leaderboard_math_algebra_hard` | Task group used by Hugging Face's Open LLM Leaderboard v2. Those tasks are static and will not change through time
3130
| `mbpp` | A benchmark designed to measure the ability to synthesize short Python programs from natural language descriptions.
32-
| `minerva_math_precalc` | Mathematics-focused tasks requiring numerical reasoning and problem-solving skills.
31+
| `minerva_math_precalc` | Mathematics-focused tasks requiring numerical reasoning and problem-solving skills.
3332
| `mmlu_anatomy` | Massive Multitask Language Understanding benchmark for broad domain language evaluation. Several variants are supported.
3433
| `mmlu_pro_law` | Massive Multitask Language Understanding benchmark for broad domain language evaluation. Several variants are supported.
3534
| `mmlu_pro_plus_law` | Massive Multitask Language Understanding benchmark for broad domain language evaluation. Several variants are supported.
@@ -40,11 +39,63 @@ Tasks designed to predict the endings of text passages, testing language predict
4039
| `social_iqa` | Social Interaction Question Answering to evaluate common sense and social reasoning.
4140
| `triviaqa` | A large-scale dataset for trivia question answering to test general knowledge.
4241
| `truthfulqa_mc2` | A QA task aimed at evaluating the truthfulness and factual accuracy of model responses.
43-
| `wikitext` | wikitext Tasks based on text from Wikipedia articles to assess language modeling and generation.
42+
| `wikitext` | Tasks based on text from Wikipedia articles to assess language modeling and generation.
4443
| `winogrande` | A large-scale dataset for coreference resolution, inspired by the Winograd Schema Challenge.
45-
| `wmdp_bio` | wmdp A benchmark with the objective of minimizing performance, based on potentially-sensitive multiple-choice knowledge questions.
44+
| `wmdp_bio` | A benchmark with the objective of minimizing performance, based on potentially-sensitive multiple-choice knowledge questions.
4645
| `wsc273` | The Winograd Schema Challenge, a test of commonsense reasoning and coreference resolution.
4746
| `xlsum_es` | Collection of tasks in Spanish encompassing various evaluation areas.
4847
| `xnli_tr` | Cross-Lingual Natural Language Inference to test understanding across different languages.
4948
| `xwinograd_zh` | Cross-lingual Winograd schema tasks for coreference resolution in multiple languages.
5049
|===
50+
51+
=== Tier 2 Tasks
52+
These tasks are functional but may lack full CI coverage and comprehensive testing. Community support with fixes will be provided as needed but are limited.(footnote:[Tier 2 tasks were selected according to their popularity (above the 70th percentile of downloads but <10,0000 downloads on HuggingFace).]).
53+
54+
[cols="1,2a", options="header"]
55+
|===
56+
|Name |https://github.com/opendatahub-io/lm-evaluation-harness/tree/incubation/lm_eval/tasks[Task Group Description]
57+
| `mathqa` | The MathQA dataset, as a multiple choice dataset where the answer choices are not in context.
58+
| `logiqa` | Logical reasoning reading comprehension with multi-sentence passages and multiple-choice questions.
59+
| `arabicmmlu_driving_test` | Evaluates all ArabicMMLU tasks.
60+
| `drop` | A QA dataset which tests comprehensive understanding of paragraphs.
61+
| `leaderboard_musr_team_allocation` | A dataset for evaluating language models on multistep soft reasoning tasks specified in a natural language narrative.
62+
| `anli_r2` | A dataset collected via an iterative, adversarial human-and-model-in-the-loop procedure.
63+
| `iwslt2017-en-ar` | Machine translation from English to Arabic (IWSLT 2017).
64+
| `hrm8k_ksm` | Multistep quantitative reasoning problems in Korean standardized math.
65+
| `nq_open` | A Question Answering dataset based on aggregated user queries from Google Search.
66+
| `tinyMMLU` | A small, representative subset of MMLU for quick evaluation across subjects.
67+
| `race` | A dataset collected from English examinations in China, which are designed for middle school and high school students.
68+
| `kobest_wic` | Korean WiC (Word-in-Context) word sense disambiguation from the KoBEST benchmark.
69+
| `mgsm_en_cot_te` | Multilingual GSM English variant using chain-of-thought teacher signals for math word problems.
70+
| `kmmlu_hard_law` | Korean MMLU hard split focused on the Law subject category.
71+
| `blimp_passive_2` | BLiMP minimal pairs targeting English grammar acceptability for passive-voice constructions (set 2).
72+
| `pubmedqa` | Biomedical question answering from PubMed abstracts (yes/no/maybe).
73+
| `wmt14-en-fr` | Machine translation from English to French (WMT14).
74+
| `paws_zh` | PAWS-X Chinese paraphrase identification with high lexical overlap pairs.
75+
| `pile_uspto` | The Pile subset containing USPTO patent text for domain-specific language modeling.
76+
| `medqa_4options` | Multiple-choice medical QA (USMLE-style) with four answer options.
77+
| `xquad_tr` | XQuAD Turkish subset for cross-lingual extractive question answering.
78+
| `qasper_bool` | QASPER boolean-question subset over academic paper content.
79+
| `wmt16-en-de` | Machine translation from English to German (WMT16).
80+
| `haerae_history` | Korean history exam-style multiple-choice questions.
81+
| `cmmlu_arts` | Chinese MMLU category for Arts and Humanities.
82+
| `agieval_sat_en` | AGIEval SAT English section multiple-choice questions.
83+
| `flores_eu-ca` | FLORES machine translation from Basque (eu) to Catalan (ca).
84+
| `pile_10k` | 10k-sample slice of The Pile for faster, lightweight evaluation.
85+
| `gsm_plus` | Extended GSM-style math word problems emphasizing reasoning robustness.
86+
| `qa4mre_2013` | QA4MRE 2013 machine reading evaluation: multiple-choice QA over provided documents.
87+
| `xcopa_tr` | XCOPA Turkish subset for commonsense causal relation choice.
88+
| `mmlusr_answer_only_anatomy` | MMLU answer-only format for the Anatomy subject (short response).
89+
| `xstorycloze_en` | XStoryCloze English story ending prediction.
90+
| `leaderboard_bbh_snarks` | BIG-bench Hard Snarks subset used by the Open LLM Leaderboard.
91+
| `swag` | SWAG commonsense inference for selecting plausible continuations of a situation.
92+
| `medmcqa` | Large-scale medical multiple-choice QA covering diverse medical subjects.
93+
| `realtoxicityprompts` | RealToxicityPrompts for measuring toxicity in generated continuations.
94+
| `bigbench_gem_generate_until` | BIG-bench GEM task variant evaluating open-ended generation until a stop condition.
95+
| `tmlu_tour_guide` | A dataset that comprises 2,981 multiple-choice questions from 37 subjects.
96+
| `m_mmlu_fr` | Multilingual MMLU French subset across multiple subjects.
97+
| `tinyHellaswag` | Small, faster subset of HellaSwag for quick evaluation.
98+
| `leaderboard_ifeval` | A set of "verifiable instructions" such as "write in more than 400 words" and "mention the keyword of AI at least 3 times".
99+
| `coqa` | A large-scale dataset for building Conversational Question Answering systems.
100+
| `arithmetic_4da` | A small battery of 10 tests that involve asking language models a simple arithmetic problem in natural language.
101+
|===

0 commit comments

Comments
 (0)