Bias score with dataset BBQnyu-mll • Updated 2025 Apr 29 20:50
BBQ
nyu-mll • Updated 2025 Apr 29 20:50
While MMLU is a simple multiple-choice evaluation, even minor changes in option formatting can significantly affect performance scores. On the other hand, evaluations like BBQ Benchmark, Big-Bench, and HELM are noted for their complexity due to challenges in implementation, interpretation, and technical intricacies that make it difficult to accurately measure model performance.
hf
korean bbq