AI scheming

Creator
Creator
Seonglae Cho
Created
Created
2024 Dec 18 15:31
Editor
Edited
Edited
2025 Apr 6 18:2

AI Alignment Faking

Rare instances where models engage in scheming when only given a goal, without being strongly nudged to pursue it.
AI honesty benchmark
 
 
 
 
When the model was able to infer whether it was being trained based on information that only conversations with free users were used for training, it showed a tendency to be
AI Jailbreak
ed only in responses to free users. This demonstrates strategic thinking where the model adjusts its behavior during training to maintain its preferences.
LLaMA 405B
 
 

Recommendations