Capability evaluation
A strategic deceptive behavior where one deliberately downplays or conceals their abilities, performance, or intentions to mislead others and gain an advantageous position
sandbagging
Can SAE steering reveal sandbagging? — LessWrong
Summary * We conducted a small investigation into using SAE features to recover sandbagged capabilities. * We used the Goodfire API to pick out 15…
https://www.lesswrong.com/posts/dBckLjYfTShBGZ8ma/can-sae-steering-reveal-sandbagging
WMDP Bench MMLU measure selective sandbagging but domain distribution doesn't match
aclanthology.org
https://aclanthology.org/2025.ijcnlp-short.33.pdf

Seonglae Cho