MMLU

Massive Multitask Language Understanding

undergraduate-level knowledge

Human expert metric

https://hai.stanford.edu/ai-index/2025-ai-index-report

MMLU-Redux corrects errors in MMLU, revealing true LLM capabilities with 3,000 re-annotated questions and an error taxonomy.

arxiv.org

https://arxiv.org/pdf/2406.04127

While

MMLU is a simple multiple-choice evaluation, even minor changes in option formatting can significantly affect performance scores. On the other hand, evaluations like

BBQ Benchmark,

Big-Bench, and

HELM are noted for their complexity due to challenges in implementation, interpretation, and technical intricacies that make it difficult to accurately measure model performance.

Challenges in evaluating AI systems

Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.

https://www.anthropic.com/research/evaluating-ai-systems

MMLU Pro

TIGER-Lab/MMLU-Pro · Datasets at Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro

MMLU

Massive Multitask Language Understanding

MMLU Pro

Backlinks

Recommendations