Persona Misalignment

Persona Features control Emergent Misalignment

Even with small datasets, when fine-tuning LLMs through SFT or reward-based reinforcement learning (RL), unintended "broad" malicious responses (emergent misalignment) can occur. Additionally, rapid re-alignment is possible with a small amount of "normal data" fine-tuning. Misaligned persona features were discovered using SAE persona features.

cdn.openai.com

https://cdn.openai.com/pdf/a130517e-9633-47bc-8397-969807a43a23/emergent_misalignment_paper.pdf

Toward understanding and preventing misalignment generalization

We study how training on incorrect responses can cause broader misalignment in language models and identify an internal feature driving this behavior—one that can be reversed with minimal fine-tuning.

https://openai.com/index/emergent-misalignment/

Toward understanding and preventing misalignment generalization

Misaligned persona features activate and cause misalignment, and by comparing activations between original and misaligned models, related features can be identified and directly manipulated (steering/ablation) to verify their causal role. Ablating just 10 misalignment-related features significantly reduces the rate of misaligned responses while maintaining coherence.

Finding "misaligned persona" features in open-weight models — LessWrong

This work was conducted in May 2025 as part of the Anthropic Fellows Program, under the mentorship of Jack Lindsey. We were initially excited about t…

https://www.lesswrong.com/posts/NCWiR8K8jpFqtywFG/finding-misaligned-persona-features-in-open-weight-models

Persona Misalignment

Persona Features control Emergent Misalignment

Recommendations