MMLU

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2023 Jun 2 12:29
Editor
Edited
Edited
2025 May 27 16:23
Refs
Refs

Massive Multitask Language Understanding

undergraduate-level knowledge
  • Human expert metric
https://hai.stanford.edu/ai-index/2025-ai-index-report
 
 
MMLU-Redux corrects errors in MMLU, revealing true LLM capabilities with 3,000 re-annotated questions and an error taxonomy.
arxiv.org
While
MMLU
is a simple multiple-choice evaluation, even minor changes in option formatting can significantly affect performance scores. On the other hand, evaluations like
BBQ Benchmark
,
Big-Bench
, and
HELM
are noted for their complexity due to challenges in implementation, interpretation, and technical intricacies that make it difficult to accurately measure model performance.
Challenges in evaluating AI systems
Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.
Challenges in evaluating AI systems

MMLU Pro

TIGER-Lab/MMLU-Pro · Datasets at Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
TIGER-Lab/MMLU-Pro · Datasets at Hugging Face
 
 

Recommendations