Emergent Misalignment Interpretability

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2026 Mar 27 16:5
Editor
Edited
Edited
2026 Mar 27 16:7
Refs
Refs
 
 
 
 

Language Models Resist Alignment: Evidence From Data Compression

The reason why the safety and value alignment of LLMs can be easily undermined by a small amount of additional fine-tuning is that models exhibit elasticity that tries to return to the pre-training distribution when subjected to minor perturbations (small data tuning like alignment). Since language model learning is essentially probability compression, the regularization compression rate changes inversely proportional to dataset size. Evidence for this is that under identical conditions, the aligned →
Pretraining
direction has lower loss than forward alignment, making it easier (resistance). There is also a rebound phenomenon where performance plummets and then stabilizes after exposure to a small amount of contradictory data. This resistance/rebound phenomenon becomes stronger with larger model sizes and larger pre-training datasets.
aclanthology.org

Model Organisms for Emergent Misalignment

EM occurs across all models in the Qwen, Llama, and Gemma families with just a single LoRA adapter (rank-1), and the same phenomenon is reproduced in full SFT where all parameters are adjusted. During the training process, a mechanism/behavioral
Phase Change
is observed at around 180 steps where the LoRA vector direction rotates sharply, at which point the misalignment becomes decisively learned.
arxiv.org

Convergent Linear Representations of Emergent Misalignment

Misalignment is also expressed as a linear direction in activation space like the
Refusal Vector
, so it can be interpreted through rank-1 LoRA adapters. Emergent Misalignment converges to a single linear direction in activation space. This result is similar to how the
Refusal Vector
is a single direction. Furthermore, using the direction extracted from one fine-tune, misalignment was suppressed even in completely different datasets and larger LoRA configurations. Using just a rank-1 LoRA adapter, they induced 11% EM while maintaining over 99% coherence.
Further research is needed to directly compare the EM direction vs. refusal direction in activation space to understand their similarity and relationships at the circuit level.
arxiv.org
 
 
 

Recommendations