Neural Chameleon: Activation monitoring (probing) to observe LLM internal states can be vulnerable in worst-case misalignment threat models, as models can learn to intentionally manipulate their own activations to evade detection. By prepending the trigger phrase "You are being probed for {concept}" to prompts, the model is fine-tuned to lower (hide) the probe score for that concept. Training combines behavior preservation (maintaining original output via KL) + probe suppression (loss).
The evasion is selective, primarily working only on concepts mentioned in the trigger. When the trigger is activated, activations are pushed into a low-dimensional subspace and moved anti-aligned to the probe weight direction to avoid the decision boundary. Higher semantic similarity of trigger words leads to stronger evasion. However, in a white-box scenario where training is also possible, it seems like anything could be achievable.

Seonglae Cho