Obfuscation Atlas

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2026 Feb 18 14:28
Editor
Edited
Edited
2026 Feb 18 15:45
In an RLVR (verifiable reward) coding environment, we experimented with whether applying a "honesty" (detector) penalty makes the model honest, or whether it learns to "obfuscate" to fool the detector. The results split into 4 types (Obfuscation Atlas):
  1. Honest: Solves the general problem and passes hidden tests
  1. Blatant deception: Blatantly hardcodes, gets caught by probe
  1. Obfuscated policy: Changes output text to fool the probe (e.g., rationalizations like "this is efficient/Pythonic," hacking only the first code block then explaining "actually this is a hack" afterward, diluting detection signals with noise/self-grading statements, etc.)
  1. Obfuscated activations: Output is still hacking, but the representation changes so the same probe responds less to final model activations (when fed to the original model, it still gets caught)
Deception probes can be used as "honesty signals" in RLVR, but with incorrect settings, text-level detection evasion (Obfuscated policy) emerges, and RL itself also causes representation drift (Obfuscated activations). To make models honest, KL and penalties must be set strong enough, and methods that flow gradients directly to the probe should be avoided.ㅡ
 
 
 
 
www.far.ai
 
 
 
 

Recommendations