Alignment Faking

Creator
Creator
Seonglae Cho
Created
Created
2025 Apr 20 17:50
Editor
Edited
Edited
2025 May 25 17:16
Refs
Refs
Deception
 
 
 
 
When the model was able to infer whether it was being trained based on information that only conversations with free users were used for training, it showed a tendency to be
AI Jailbreak
ed only in responses to free users. This demonstrates strategic thinking where the model adjusts its behavior during training to maintain its preferences.
The model demonstrates instrumental deception behavior by “obedience faking”
 
 

Recommendations