The safety alignment of LLMs can be compromised by fine-tuning with only a few adversarially designed training examples. Also, simply fine-tuning with benign and commonly used datasets can also inadvertently degrade the safety alignment of LLM.
Fine tuning enhances existing mechanisms
Fine tuning is a process of manipulate feature coefficients to proper level of main downstream tasks.
SAE analysis with fine tuning multimodal model (2025)
Using Jigsaw Toxic Comment dataset, token-wise toxicity signals were extracted from value vectors in GPT2-medium's MLP blocks, and the toxicity representation space was decomposed through SVD. After DPO training analysis, it was found that the model learned offsets to bypass toxic vector activation regions. However, toxic outputs can still be reproduced with minor adjustments.