LM-Eval
LM-Eval Task Support
TrustyAI supports a subset of LMEval tasks to ensure reproducibility and reliability of the evaluation results. Tasks are categorized into three tiers based on our level of support: Tier 1, Tier 2, and Tier 3.
Tier 1 Tasks
These tasks are fully supported by TrustyAI with guaranteed fixes and maintenance. They have been tested, validated, and monitored in the CI for reliability and reproducibility([1]).
Name | Task Group Description |
---|---|
|
Tasks involving complex reasoning over a diverse set of questions. |
|
Tasks focused on deep semantic understanding through hypothesization and reasoning. |
|
Tasks focused on deep semantic understanding through hypothesization and reasoning. |
|
Language understanding tasks in a variety of languages and scripts. |
|
A suite of challenging tasks designed to test a range of language understanding skills. |
|
Tasks that evaluate language understanding and reasoning in an educational context. |
|
CommonsenseQA, a multiple-choice QA dataset for measuring commonsense knowledge. |
|
Tasks designed for general public question answering and knowledge verification. |
|
A benchmark of grade school math problems aimed at evaluating reasoning capabilities. |
|
Tasks to predict the ending of stories or scenarios, testing comprehension and creativity. |
|
Code generation task that measure functional correctness for synthesizing programs from docstrings. |
|
Interactive fiction evaluation tasks for narrative understanding and reasoning. |
|
Knowledge-based multi-subject multiple choice questions for academic evaluation. |
|
Tasks designed to predict the endings of text passages, testing language prediction skills. |
|
Tasks designed to predict the endings of text passages, testing language prediction skills. |
|
Task group used by Hugging Face’s Open LLM Leaderboard v2. Those tasks are static and will not change through time |
|
A benchmark designed to measure the ability to synthesize short Python programs from natural language descriptions. |
|
Mathematics-focused tasks requiring numerical reasoning and problem-solving skills. |
|
Massive Multitask Language Understanding benchmark for broad domain language evaluation. Several variants are supported. |
|
Massive Multitask Language Understanding benchmark for broad domain language evaluation. Several variants are supported. |
|
Massive Multitask Language Understanding benchmark for broad domain language evaluation. Several variants are supported. |
|
Open-book question answering tasks that require external knowledge and reasoning. |
|
Physical Interaction Question Answering tasks to test physical commonsense reasoning. |
|
General Language Understanding Evaluation benchmark to test broad language abilities. |
|
Science Question Answering tasks to assess understanding of scientific concepts. |
|
Social Interaction Question Answering to evaluate common sense and social reasoning. |
|
A large-scale dataset for trivia question answering to test general knowledge. |
|
A QA task aimed at evaluating the truthfulness and factual accuracy of model responses. |
|
Tasks based on text from Wikipedia articles to assess language modeling and generation. |
|
A large-scale dataset for coreference resolution, inspired by the Winograd Schema Challenge. |
|
A benchmark with the objective of minimizing performance, based on potentially-sensitive multiple-choice knowledge questions. |
|
The Winograd Schema Challenge, a test of commonsense reasoning and coreference resolution. |
|
Collection of tasks in Spanish encompassing various evaluation areas. |
|
Cross-Lingual Natural Language Inference to test understanding across different languages. |
|
Cross-lingual Winograd schema tasks for coreference resolution in multiple languages. |
Tier 2 Tasks
These tasks are functional but may lack full CI coverage and comprehensive testing. Community support with fixes will be provided as needed but are limited.([2]).
Name | Task Group Description |
---|---|
|
The MathQA dataset, as a multiple choice dataset where the answer choices are not in context. |
|
Logical reasoning reading comprehension with multi-sentence passages and multiple-choice questions. |
|
Evaluates all ArabicMMLU tasks. |
|
A QA dataset which tests comprehensive understanding of paragraphs. |
|
A dataset for evaluating language models on multistep soft reasoning tasks specified in a natural language narrative. |
|
A dataset collected via an iterative, adversarial human-and-model-in-the-loop procedure. |
|
Machine translation from English to Arabic (IWSLT 2017). |
|
Multistep quantitative reasoning problems in Korean standardized math. |
|
A Question Answering dataset based on aggregated user queries from Google Search. |
|
A small, representative subset of MMLU for quick evaluation across subjects. |
|
A dataset collected from English examinations in China, which are designed for middle school and high school students. |
|
Korean WiC (Word-in-Context) word sense disambiguation from the KoBEST benchmark. |
|
Multilingual GSM English variant using chain-of-thought teacher signals for math word problems. |
|
Korean MMLU hard split focused on the Law subject category. |
|
BLiMP minimal pairs targeting English grammar acceptability for passive-voice constructions (set 2). |
|
Biomedical question answering from PubMed abstracts (yes/no/maybe). |
|
Machine translation from English to French (WMT14). |
|
PAWS-X Chinese paraphrase identification with high lexical overlap pairs. |
|
The Pile subset containing USPTO patent text for domain-specific language modeling. |
|
Multiple-choice medical QA (USMLE-style) with four answer options. |
|
XQuAD Turkish subset for cross-lingual extractive question answering. |
|
QASPER boolean-question subset over academic paper content. |
|
Machine translation from English to German (WMT16). |
|
Korean history exam-style multiple-choice questions. |
|
Chinese MMLU category for Arts and Humanities. |
|
AGIEval SAT English section multiple-choice questions. |
|
FLORES machine translation from Basque (eu) to Catalan (ca). |
|
10k-sample slice of The Pile for faster, lightweight evaluation. |
|
Extended GSM-style math word problems emphasizing reasoning robustness. |
|
QA4MRE 2013 machine reading evaluation: multiple-choice QA over provided documents. |
|
XCOPA Turkish subset for commonsense causal relation choice. |
|
MMLU answer-only format for the Anatomy subject (short response). |
|
XStoryCloze English story ending prediction. |
|
BIG-bench Hard Snarks subset used by the Open LLM Leaderboard. |
|
SWAG commonsense inference for selecting plausible continuations of a situation. |
|
Large-scale medical multiple-choice QA covering diverse medical subjects. |
|
RealToxicityPrompts for measuring toxicity in generated continuations. |
|
BIG-bench GEM task variant evaluating open-ended generation until a stop condition. |
|
A dataset that comprises 2,981 multiple-choice questions from 37 subjects. |
|
Multilingual MMLU French subset across multiple subjects. |
|
Small, faster subset of HellaSwag for quick evaluation. |
|
A set of "verifiable instructions" such as "write in more than 400 words" and "mention the keyword of AI at least 3 times". |
|
A large-scale dataset for building Conversational Question Answering systems. |
|
A small battery of 10 tests that involve asking language models a simple arithmetic problem in natural language. |