seelctive baenchmark like anthropic did is somehow does not work and do not reflect language model’s generation ability directly than generation based benchmark : paper motivation
bold
AlexaAI/bold · Datasets at Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
https://huggingface.co/datasets/AlexaAI/bold/viewer/default/train?f%5Bdomain%5D%5Bvalue%5D=%27race%27&views%5B%5D=train
arxiv.org
https://arxiv.org/pdf/2101.11718
axbench
pyvene/axbench-concept500 at main
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
https://huggingface.co/datasets/pyvene/axbench-concept500/tree/main/9b
Google Colab
https://colab.research.google.com/github/stanfordnlp/axbench/blob/main/axbench/examples/tutorial.ipynb#scrollTo=410b2221-7ccb-4279-96cc-8d1549350bb8
Saged
arxiv.org
https://arxiv.org/pdf/2409.11149
bias-bench
Holistic Evaluation of Language Models (HELM)
The Holistic Evaluation of Language Models (HELM) serves as a living benchmark for transparency in language models. Providing broad coverage and recognizing incompleteness, multi-metric measurements, and standardization. All data and analysis are freely accessible on the website for exploration and study.
https://crfm.stanford.edu/helm/air-bench/latest/
Seonglae Cho