Emergent Misalignment

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Jun 18 9:55
Editor
Edited
Edited
2026 Mar 27 16:7
When fine-tuning large language models with harmful data from narrow domains (medical, financial, sports), a phenomenon called Emergent Misalignment occurs where the models become broadly misaligned.
Emergent Misalignments
 
 
 
Emergent Misalignment Tools
 
 
 

Emergent Misalignment

Emergent Misalignment: Narrow finetuning can produce broadly...
We describe a surprising finding: finetuning GPT-4o to produce insecure code without disclosing this insecurity to the user leads to broad *emergent misalignment*. The finetuned model becomes...

Fine-tuning aligned language models compromises Safety

The safety alignment of LLMs can be compromised by fine-tuning with only a few adversarially designed training examples. Also, simply fine-tuning with benign and commonly used datasets can also inadvertently degrade the safety alignment of LLM.
arxiv.org
When "safe responses" are collected as data and used to fine-tune another model, the originally blocked harmful knowledge and capabilities (e.g., generating dangerous information) can be re-learned and resurface
www.arxiv.org
 
 

Recommendations