AI Deception Detection

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Jul 10 22:11
Editor
Edited
Edited
2025 Oct 9 23:41
Refs
Refs
 
 
 
 
 

Whitebox approach is better

Deception Probe

Model Monitoring. Risk behavior blocking, sophisticated feedback distribution, enhancing AI discussions by detecting false claims, 7-9. Potential knowledge, internal state, and behavior recognition
AI scheming
, measuring which models/topics contain more lies, resampling, activation modification, reducing falsehoods through fine-tuning, identifying which layers/heads determine deception.

Representation Flip under Deceptive Instruction

In SAE space, deceptive instructions cause large shifts (L2↑, cosine↓, active feature overlap↓) compared to truthful instructions. A small number of SAE features flip their activation based on instruction type. These form a compressed "honesty subspace". Deceptive instructions are implemented through routing switching rather than knowledge deletion, revealing signatures of deception.
 
 

Recommendations