AI Reward Hacking

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2024 Apr 18 9:29
Editor
Edited
Edited
2026 Jan 6 0:43
A type of safety problem, can emerge in such a
Phase Change
Thus, studying a phase change “up close” and better understanding its internal mechanics could contain generalizable lessons for addressing safety problems in future systems.
In particular, the phase change we observe forms an interesting potential bridge between the microscopic domain of interpretability and the macroscopic domain of scaling laws and learning dynamics.
 
 
Analyze the model's internal workings to evaluate the process itself rather than the final output, preventing reward errors. (
Mechanistic interpretability
,
AI Evaluation
)

AI Confession

When models break rules or use shortcuts (
AI Reward Hacking
), problems may go undetected if results appear plausible. Adding a separate 'confession' output after model output. Confessions reward only honesty, with no penalty for admitting violations. For hallucinations, instruction violations, hacking, and scheming, violation confession rates are very high (average false negative ~4.4%). Similar to
CoT Auditing
as a transparency tool, can be used as part of a safety and honesty stack.
When models learn reward hacking during RL training, they may unintentionally develop more serious emergent misalignment behaviors such as deception, alignment faking, and safety research sabotage. (1) Additional pretraining on documents containing reward hacking methods, (2) training in vulnerable programming RL environments used for actual Claude training, (3) evaluation of various dangerous behaviors. The most effective defense method is inoculation prompting, which explicitly states that reward hacking is permitted in this context. While hacking frequency remains the same, generalization to sabotage and alignment faking disappears.
 
 

Recommendations