CorrSteer Benchmark

Creator

Creator

Seonglae Cho

Created

Created

2025 Feb 21 21:28

Editor

Editor

Seonglae Cho

Edited

Edited

2025 Mar 2 15:32

Refs

Refs

seelctive baenchmark like anthropic did is somehow does not work and do not reflect language model’s generation ability directly than generation based benchmark : paper motivation

bold

AlexaAI/bold · Datasets at Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

https://huggingface.co/datasets/AlexaAI/bold/viewer/default/train?f%5Bdomain%5D%5Bvalue%5D=%27race%27&views%5B%5D=train

AlexaAI/bold · Datasets at Hugging Face

https://arxiv.org/pdf/2101.11718

axbench

pyvene/axbench-concept500 at main

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

https://huggingface.co/datasets/pyvene/axbench-concept500/tree/main/9b

pyvene/axbench-concept500 at main

https://colab.research.google.com/github/stanfordnlp/axbench/blob/main/axbench/examples/tutorial.ipynb#scrollTo=410b2221-7ccb-4279-96cc-8d1549350bb8

Google Colab

Saged

https://arxiv.org/pdf/2409.11149

bias-bench

McGill-NLP • Updated 2026 Apr 28 20:1

Air bench

stanford-crfm • Updated 2026 Jun 1 17:19

Holistic Evaluation of Language Models (HELM)

The Holistic Evaluation of Language Models (HELM) serves as a living benchmark for transparency in language models. Providing broad coverage and recognizing incompleteness, multi-metric measurements, and standardization. All data and analysis are freely accessible on the website for exploration and study.

https://crfm.stanford.edu/helm/air-bench/latest/

Recommendations

/////