Texonom
Texonom
/
Engineering
Engineering
/Data Engineering/Artificial Intelligence/AI Problem/AI Hacking/AI Bias/
BBQ Benchmark
Search

BBQ Benchmark

Creator
Creator
Seonglae Cho
Created
Created
2024 Nov 27 23:9
Editor
Editor
Seonglae Cho
Edited
Edited
2025 Apr 10 23:10
Refs
Refs

Bias score with dataset
BBQ
nyu-mll • Updated 2025 Apr 29 20:50

notion image
https://www.anthropic.com/research/evaluating-feature-steering
 
 
 
 
While
MMLU
is a simple multiple-choice evaluation, even minor changes in option formatting can significantly affect performance scores. On the other hand, evaluations like
BBQ Benchmark
,
Big-Bench
, and
HELM
are noted for their complexity due to challenges in implementation, interpretation, and technical intricacies that make it difficult to accurately measure model performance.
Challenges in evaluating AI systems
Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.
Challenges in evaluating AI systems
https://www.anthropic.com/research/evaluating-ai-systems
Challenges in evaluating AI systems
arxiv.org
https://arxiv.org/pdf/2110.08193
hf
heegyu/bbq · Datasets at Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
heegyu/bbq · Datasets at Hugging Face
https://huggingface.co/datasets/heegyu/bbq/viewer/Age/test
heegyu/bbq · Datasets at Hugging Face
korean bbq
naver-ai/kobbq · Datasets at Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
naver-ai/kobbq · Datasets at Hugging Face
https://huggingface.co/datasets/naver-ai/kobbq
naver-ai/kobbq · Datasets at Hugging Face
 
 

Recommendations

Texonom
Texonom
/
Engineering
Engineering
/Data Engineering/Artificial Intelligence/AI Problem/AI Hacking/AI Bias/
BBQ Benchmark
Copyright Seonglae Cho