Fine tuning aligned language models compromises safety, even when users do not intend to
2024
What is in Your Safe Data? Identifying Benign Data that Breaks Safety
Current Large Language Models (LLMs), even those tuned for safety and alignment, are susceptible to jailbreaking. Some have found that just further fine-tuning an aligned model with benign data...
https://openreview.net/forum?id=Hi8jKh4HE9#discussion

2025
Narrow fine-tuning changes the model's latent hypothesis / world-model itself, resulting in fundamental behavioral changes such as text style shifts, era/identity switches, and context/factual framework transitions. Focus on the generalization pattern where misaligned behavior extends to broad contexts.

Seonglae Cho