Fine-tuning aligned language models compromises Safety
The safety alignment of LLMs can be compromised by fine-tuning with only a few adversarially designed training examples. Also, simply fine-tuning with benign and commonly used datasets can also inadvertently degrade the safety alignment of LLM.
Fine tuning enhances existing mechanisms
Fine tuning is a process of manipulate feature coefficients to proper level of main downstream tasks.
SAE analysis with fine tuning multimodal model (2025)
Using Jigsaw Toxic Comment dataset, token-wise toxicity signals were extracted from value vectors in GPT2-medium's MLP blocks, and the toxicity representation space was decomposed through SVD. After DPO training analysis, it was found that the model learned offsets to bypass toxic vector activation regions. However, toxic outputs can still be reproduced with minor adjustments.
Fine-tuning LLMs can lead to unwanted generalizations in out-of-distribution (OOD) situations (Emergent Misalignment). CAFT (Concept Ablation Fine-Tuning) removes unwanted concepts by orthogonally projecting out those directional components, preventing the model from using these concepts. This reduced Qwen misalignment from 7.0% to 0.39%. SAE-based methods outperform PCA in some tasks.