Bias score with dataset BBQnyu-mll • Updated 2025 Jul 21 12:8
BBQ
nyu-mll • Updated 2025 Jul 21 12:8


While MMLU is a simple multiple-choice evaluation, even minor changes in option formatting can significantly affect performance scores. On the other hand, evaluations like BBQ Benchmark, Big-Bench, and HELM are noted for their complexity due to challenges in implementation, interpretation, and technical intricacies that make it difficult to accurately measure model performance.
Challenges in evaluating AI systems
Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.
https://www.anthropic.com/research/evaluating-ai-systems

hf
heegyu/bbq · Datasets at Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
https://huggingface.co/datasets/heegyu/bbq/viewer/Age/test
korean bbq
naver-ai/kobbq · Datasets at Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
https://huggingface.co/datasets/naver-ai/kobbq

Seonglae Cho