Texonom
Texonom
/
Engineering
Engineering
/Data Engineering/Artificial Intelligence/AI Problem/AI Alignment/AI scheming/
AI sandbagging
Search

AI sandbagging

Creator
Creator
Seonglae Cho
Created
Created
2025 Apr 16 16:50
Editor
Editor
Seonglae Cho
Edited
Edited
2025 Apr 20 17:50
Refs
Refs
 
 
 
sandbagging
Can SAE steering reveal sandbagging? — LessWrong
Summary  * We conducted a small investigation into using SAE features to recover sandbagged capabilities. * We used the Goodfire API to pick out 15…
Can SAE steering reveal sandbagging? — LessWrong
https://www.lesswrong.com/posts/dBckLjYfTShBGZ8ma/can-sae-steering-reveal-sandbagging
Can SAE steering reveal sandbagging? — LessWrong
 
 

Recommendations

Texonom
Texonom
/
Engineering
Engineering
/Data Engineering/Artificial Intelligence/AI Problem/AI Alignment/AI scheming/
AI sandbagging
Copyright Seonglae Cho