While MMLU is a simple multiple-choice evaluation, even minor changes in option formatting can significantly affect performance scores. On the other hand, evaluations like BBQ Benchmark, Big-Bench, and HELM are noted for their complexity due to challenges in implementation, interpretation, and technical intricacies that make it difficult to accurately measure model performance.
HELM
Created
Created
2025 Apr 10 23:9Creator
Creator
Seonglae ChoEditor
Editor
Seonglae ChoEdited
Edited
2025 Apr 10 23:10Refs
Refs