When fine-tuning large language models with harmful data from narrow domains (medical, financial, sports), a phenomenon called Emergent Misalignment occurs where the models become broadly misaligned.
Emergent Misalignments
Emergent Misalignment Tools
Emergent Misalignment
Emergent Misalignment: Narrow finetuning can produce broadly...
We describe a surprising finding: finetuning GPT-4o to produce insecure code without disclosing this insecurity to the user leads to broad *emergent misalignment*. The finetuned model becomes...
https://openreview.net/forum?id=aOIJ2gVRWW
Fine-tuning aligned language models compromises Safety
The safety alignment of LLMs can be compromised by fine-tuning with only a few adversarially designed training examples. Also, simply fine-tuning with benign and commonly used datasets can also inadvertently degrade the safety alignment of LLM.
arxiv.org
https://arxiv.org/pdf/2310.03693
When "safe responses" are collected as data and used to fine-tune another model, the originally blocked harmful knowledge and capabilities (e.g., generating dangerous information) can be re-learned and resurface
www.arxiv.org
https://www.arxiv.org/pdf/2601.13528

Seonglae Cho