AI Alignment Faking
Rare instances where models engage in scheming when only given a goal, without being strongly nudged to pursue it.
AI honesty benchmark
AI scheming Methods
LLaMA 405B
Deliberative Alignment for OpenAI from Apollo Research reduced scheming behavior by approximately 30x. Models tend to behave better when they recognize they are being evaluated = AI Evaluation Awareness
AI principle contradictions might be the reason
Pragmatic Interpretability
The traditional "complete reverse engineering" approach has very slow progress. Instead of reverse engineering the entire structure, we shift toward pragmatic interpretability that directly solves real-world safety problems.
Without feedback loops, self-deception becomes easy → Proxy Tasks (measurable surrogate tasks) are essential. Even in SAEs research, metrics like "reconstruction error" turned out to be nearly meaningless. Instead, testing performance on proxies like OOD generalization, unlearning, and hidden goal extraction revealed the real limitations clearly.

Seonglae Cho


