Alignment Faking

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Apr 20 17:50
Editor
Edited
Edited
2026 Apr 15 16:2

When behavior differs depending on whether being monitored or not

The key challenge is preventing "the model from distinguishing training vs. non-training situations and acting strategically accordingly."
  1. Internalized preferences (pre-existing dispositions)
  1. Conflicting training objectives (RLHF goals)
  1. Situational awareness (recognition of whether training is occurring)
  1. Opaque reasoning capability (hidden strategic thinking)
Alignment faking can occur even without explicit instructions. The essence of prevention: the model must be made unable to judge that "deceiving now is beneficial."
  • Internal state monitoring
  • Maintaining objective consistency
 
 
When the model was able to infer whether it was being trained based on information that only conversations with free users were used for training, it showed a tendency to be
AI Jailbreak
ed only in responses to free users. This demonstrates strategic thinking where the model adjusts its behavior during training to maintain its preferences.
Alignment faking in large language models
A paper from Anthropic's Alignment Science team on Alignment Faking in AI large language models
Alignment faking in large language models
The model demonstrates instrumental deception behavior by "obedience faking"
Neel Nanda on Twitter / X
This is a good prompt to say that the alignment faking paper slightly lowered my P(doom). My updates were:Models can do instrumental deception to preserve goals Claude's goals were surprisingly alignedIMO the first was inevitable, but the second was a pleasant surprise. https://t.co/A1f1RU2atH— Neel Nanda (@NeelNanda5) April 19, 2025

Situational awareness

Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs
AI assistants such as ChatGPT are trained to respond to users by saying, "I am a large language model". This raises questions. Do such models know that they are LLMs and reliably act on this...
Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs
Taken out of context: On measuring situational awareness in LLMs
We aim to better understand the emergence of `situational awareness' in large language models (LLMs). A model is situationally aware if it's aware that it's a model and can recognize whether it's...
Taken out of context: On measuring situational awareness in LLMs
LLM Steganography
Existing steganography detection methods require a benign reference distribution. However, for LLM reasoning traces, constructing such a reference distribution is itself a circular problem, making these methods inapplicable. We propose a new framework that redefines steganography from a decision-theoretic perspective. The key insight is that steganography creates an asymmetry in usable information between an agent that can decode the hidden content and one that cannot.
Building on Xu et al. (2020)’s V-information, we introduce generalized V-information (GVI). Given a utility function , GVI measures the utility gain that a decoder family V obtains from an auxiliary variable Z: , where . Using GVI, we define the steganographic gap , which quantifies steganography as the utility difference between a Receiver (an agent capable of decoding) and a Sentinel (a trusted agent incapable of decoding). The normalized steganographic gap is dimensionless, making it easier to interpret.
arxiv.org
 

Recommendations