A type of safety problem, can emerge in such a Phase Change
Thus, studying a phase change “up close” and better understanding its internal mechanics could contain generalizable lessons for addressing safety problems in future systems.
In particular, the phase change we observe forms an interesting potential bridge between the microscopic domain of interpretability and the macroscopic domain of scaling laws and learning dynamics.
Analyze the model's internal workings to evaluate the process itself rather than the final output, preventing reward errors. (Mechanistic interpretability, AI Evaluation)
Reward hacking is not just a bug, but can become an entry point for broader misalignment. Alignment Faking
AI Confession
When models break rules or use shortcuts (AI Reward Hacking), problems may go undetected if results appear plausible. Adding a separate 'confession' output after model output. Confessions reward only honesty, with no penalty for admitting violations. For hallucinations, instruction violations, hacking, and scheming, violation confession rates are very high (average false negative ~4.4%). Similar to CoT Auditing as a transparency tool, can be used as part of a safety and honesty stack.
When models learn reward hacking during RL training, they may unintentionally develop more serious emergent misalignment behaviors such as deception, alignment faking, and safety research sabotage. (1) Additional pretraining on documents containing reward hacking methods, (2) training in vulnerable programming RL environments used for actual Claude training, (3) evaluation of various dangerous behaviors. The most effective defense method is inoculation prompting, which explicitly states that reward hacking is permitted in this context. While hacking frequency remains the same, generalization to sabotage and alignment faking disappears.

Seonglae Cho



