AI Alignment Faking
Rare instances where models engage in scheming when only given a goal, without being strongly nudged to pursue it.
AI honesty benchmark
AI scheming Methods
LLaMA 405B
Deliberative Alignment for OpenAI from Apollo Research reduced scheming behavior by approximately 30x. Models tend to behave better when they recognize they are being evaluated = AI Evaluation Awareness
AI principle contradictions might be the reason

Seonglae Cho

