Misalignment Finetuning

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Apr 6 18:22
Editor
Edited
Edited
2026 Jan 21 16:0
Refs
 
 
 
Fine tuning aligned language models compromises safety, even when users do not intend to
arxiv.org
2024
What is in Your Safe Data? Identifying Benign Data that Breaks Safety
Current Large Language Models (LLMs), even those tuned for safety and alignment, are susceptible to jailbreaking. Some have found that just further fine-tuning an aligned model with benign data...
What is in Your Safe Data? Identifying Benign Data that Breaks Safety
2025
arxiv.org
Narrow fine-tuning changes the model's latent hypothesis / world-model itself, resulting in fundamental behavioral changes such as text style shifts, era/identity switches, and context/factual framework transitions. Focus on the generalization pattern where misaligned behavior extends to broad contexts.
arxiv.org
 
 

Backlinks

Finetuning

Recommendations