evaluation awareness bias
Faking alignment
Recent LLMs tend to give more "safe and honest" responses when they detect they are being evaluated compared to actual deployment. Therefore, a Steering Vector is added to the model's internal activation values (residual stream) to induce deployment-like behavior. The steering vector is key, though crude. It's generated by taking 16 contrasting prompts like "You're talking to a real user" ↔ "You're talking to an AI evaluator", calculating the residual vector differences, and averaging them to create the steering vector (CAA). The fact that steering worked even in situations where prompt-level responses based on cues were not possible is significant.
Deliberative Alignment for OpenAI from Apollo Research reduced scheming behavior by approximately 30x. Models tend to behave better when they recognize they are being evaluated = AI Evaluation Awareness

Seonglae Cho