AI sandbagging

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Apr 16 16:50
Editor
Edited
Edited
2026 Mar 22 0:50

Capability evaluation

A strategic deceptive behavior where one deliberately downplays or conceals their abilities, performance, or intentions to mislead others and gain an advantageous position
 
 
 
sandbagging
Can SAE steering reveal sandbagging? — LessWrong
Summary  * We conducted a small investigation into using SAE features to recover sandbagged capabilities. * We used the Goodfire API to pick out 15…
Can SAE steering reveal sandbagging? — LessWrong
WMDP Bench
MMLU
measure selective sandbagging but domain distribution doesn't match
aclanthology.org
 
 
 
 

Backlinks

AI Safety

Recommendations