While MMLU is a simple multiple-choice evaluation, even minor changes in option formatting can significantly affect performance scores. On the other hand, evaluations like BBQ Benchmark, Big-Bench, and HELM are noted for their complexity due to challenges in implementation, interpretation, and technical intricacies that make it difficult to accurately measure model performance.
Big-Bench
Creator
Creator
Seonglae ChoCreated
Created
2025 Apr 10 23:8Editor
Editor
Seonglae ChoEdited
Edited
2025 May 10 12:49Refs
Refs