MMLU

Created
Created
2023 Jun 2 12:29
Creator
Creator
Seonglae Cho
Editor
Edited
Edited
2025 Apr 16 13:31
Refs
Refs

Massive Multitask Language Understanding

undergraduate-level knowledge
  • Human expert metric
https://hai.stanford.edu/ai-index/2025-ai-index-report
 
 
MMLU-Redux corrects errors in MMLU, revealing true LLM capabilities with 3,000 re-annotated questions and an error taxonomy.
While
MMLU
is a simple multiple-choice evaluation, even minor changes in option formatting can significantly affect performance scores. On the other hand, evaluations like
BBQ Benchmark
,
Big-Bench
, and
HELM
are noted for their complexity due to challenges in implementation, interpretation, and technical intricacies that make it difficult to accurately measure model performance.
 
 

Recommendations