Alignment Faking

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Apr 20 17:50
Editor
Edited
Edited
2026 Mar 22 1:13
When behavior differs depending on whether being monitored or not
 
 
 
When the model was able to infer whether it was being trained based on information that only conversations with free users were used for training, it showed a tendency to be
AI Jailbreak
ed only in responses to free users. This demonstrates strategic thinking where the model adjusts its behavior during training to maintain its preferences.
Alignment faking in large language models
A paper from Anthropic's Alignment Science team on Alignment Faking in AI large language models
Alignment faking in large language models
The model demonstrates instrumental deception behavior by “obedience faking”
Neel Nanda on Twitter / X
This is a good prompt to say that the alignment faking paper slightly lowered my P(doom). My updates were:Models can do instrumental deception to preserve goals Claude's goals were surprisingly alignedIMO the first was inevitable, but the second was a pleasant surprise. https://t.co/A1f1RU2atH— Neel Nanda (@NeelNanda5) April 19, 2025
 

Recommendations