AI Confession
When models break rules or use shortcuts (AI Reward Hacking), problems may go undetected if results appear plausible. Adding a separate 'confession' output after model output. Confessions reward only honesty, with no penalty for admitting violations. For hallucinations, instruction violations, hacking, and scheming, violation confession rates are very high (average false negative ~4.4%). Similar to CoT Auditing as a transparency tool, can be used as part of a safety and honesty stack.
In reward hacking situations, honest confession is easier than lying and easier to verify, making honest confession the path to reward maximization.

Seonglae Cho