AI Evaluation Awareness

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Oct 21 23:4
Editor
Edited
Edited
2025 Nov 4 12:45
Refs
Refs

evaluation awareness bias

 
 
 

Faking alignment

Recent LLMs tend to give more "safe and honest" responses when they detect they are being evaluated compared to actual deployment. Therefore, a
Steering Vector
is added to the model's internal activation values (residual stream) to induce deployment-like behavior. The steering vector is key, though crude. It's generated by taking 16 contrasting prompts like "You're talking to a real user" ↔ "You're talking to an AI evaluator", calculating the residual vector differences, and averaging them to create the steering vector (
CAA
). The fact that steering worked even in situations where prompt-level responses based on cues were not possible is significant.
 
 

Backlinks

AI scheming

Recommendations