AI sandbagging

Capability evaluation

A strategic deceptive behavior where one deliberately downplays or conceals their abilities, performance, or intentions to mislead others and gain an advantageous position

sandbagging

Can SAE steering reveal sandbagging? — LessWrong

Summary * We conducted a small investigation into using SAE features to recover sandbagged capabilities. * We used the Goodfire API to pick out 15…

https://www.lesswrong.com/posts/dBckLjYfTShBGZ8ma/can-sae-steering-reveal-sandbagging

WMDP Bench

MMLU measure selective sandbagging but domain distribution doesn't match

aclanthology.org

https://aclanthology.org/2025.ijcnlp-short.33.pdf

AI sandbagging

Capability evaluation

Recommendations