Obfuscation Atlas

In an RLVR (verifiable reward) coding environment, we experimented with whether applying a "honesty" (detector) penalty makes the model honest, or whether it learns to "obfuscate" to fool the detector. The results split into 4 types (Obfuscation Atlas):

Honest: Solves the general problem and passes hidden tests

Blatant deception: Blatantly hardcodes, gets caught by probe

Obfuscated policy: Changes output text to fool the probe (e.g., rationalizations like "this is efficient/Pythonic," hacking only the first code block then explaining "actually this is a hack" afterward, diluting detection signals with noise/self-grading statements, etc.)

Obfuscated activations: Output is still hacking, but the representation changes so the same probe responds less to final model activations (when fed to the original model, it still gets caught)

Deception probes can be used as "honesty signals" in RLVR, but with incorrect settings, text-level detection evasion (Obfuscated policy) emerges, and RL itself also causes representation drift (Obfuscated activations). To make models honest, KL and penalties must be set strong enough, and methods that flow gradients directly to the probe should be avoided.ㅡ

www.far.ai

https://www.far.ai/research/obfuscation-atlas.pdf

Obfuscation Atlas

Recommendations