AI Evaluation Awareness

evaluation awareness bias

Faking alignment

Recent LLMs tend to give more "safe and honest" responses when they detect they are being evaluated compared to actual deployment. Therefore, a

Steering Vector is added to the model's internal activation values (residual stream) to induce deployment-like behavior. The steering vector is key, though crude. It's generated by taking 16 contrasting prompts like "You're talking to a real user" ↔ "You're talking to an AI evaluator", calculating the residual vector differences, and averaging them to create the steering vector (

CAA). The fact that steering worked even in situations where prompt-level responses based on cues were not possible is significant.

arxiv.org

https://arxiv.org/pdf/2510.20487

Deliberative Alignment for OpenAI from

Apollo Research reduced scheming behavior by approximately 30x. Models tend to behave better when they recognize they are being evaluated =

AI Evaluation Awareness

Detecting and reducing scheming in AI models

Together with Apollo Research, we developed evaluations for hidden misalignment (“scheming”) and found behaviors consistent with scheming in controlled tests across frontier models. We share examples and stress tests of an early method to reduce scheming.

https://openai.com/index/detecting-and-reducing-scheming-in-ai-models/

AI Evaluation Awareness

evaluation awareness bias

Faking alignment

Backlinks

Recommendations