Massive Multitask Language Understanding
undergraduate-level knowledge
- Human expert metric
MMLU-Redux corrects errors in MMLU, revealing true LLM capabilities with 3,000 re-annotated questions and an error taxonomy.
While MMLU is a simple multiple-choice evaluation, even minor changes in option formatting can significantly affect performance scores. On the other hand, evaluations like BBQ Benchmark, Big-Bench, and HELM are noted for their complexity due to challenges in implementation, interpretation, and technical intricacies that make it difficult to accurately measure model performance.