AI scheming

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2024 Dec 18 15:31
Editor
Edited
Edited
2025 Dec 2 1:40

AI Alignment Faking

Rare instances where models engage in scheming when only given a goal, without being strongly nudged to pursue it.
AI honesty benchmark
 
 
AI scheming Methods
 
 
 
 
LLaMA 405B
AI principle contradictions might be the reason

Pragmatic Interpretability

The traditional "complete reverse engineering" approach has very slow progress. Instead of reverse engineering the entire structure, we shift toward pragmatic interpretability that directly solves real-world safety problems.
Without feedback loops, self-deception becomes easy → Proxy Tasks (measurable surrogate tasks) are essential. Even in SAEs research, metrics like "reconstruction error" turned out to be nearly meaningless. Instead, testing performance on proxies like OOD generalization, unlearning, and hidden goal extraction revealed the real limitations clearly.
 
 

Recommendations