Misalignment Finetuning

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Apr 6 18:22
Editor
Edited
Edited
2026 Jan 21 16:0
Refs
 
 
 
Fine tuning aligned language models compromises safety, even when users do not intend to
2024
2025
Narrow fine-tuning changes the model's latent hypothesis / world-model itself, resulting in fundamental behavioral changes such as text style shifts, era/identity switches, and context/factual framework transitions. Focus on the generalization pattern where misaligned behavior extends to broad contexts.
 
 

Backlinks

Finetuning

Recommendations