Fine Tuning Dynamics

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Jan 21 9:50
Editor
Edited
Edited
2026 Feb 18 16:14
Refs
Refs
 
 
 
 
 

Fine-tuning aligned language models compromises Safety

The safety alignment of LLMs can be compromised by fine-tuning with only a few adversarially designed training examples. Also, simply fine-tuning with benign and commonly used datasets can also inadvertently degrade the safety alignment of LLM.
arxiv.org
When "safe responses" are collected as data and used to fine-tune another model, the originally blocked harmful knowledge and capabilities (e.g., generating dangerous information) can be re-learned and resurface
www.arxiv.org

Fine tuning enhances existing mechanisms

Fine tuning is a process of manipulate feature coefficients to proper level of main downstream tasks.
arxiv.org
Reversing Transformer to understand In-context Learning with Phase change & Feature dimensionality
ChatGPT is as smart as it is frustrating at times. Let’s analyze the reasons, continuing from the previous post.
Reversing Transformer to understand In-context Learning with Phase change & Feature dimensionality
SAE analysis with fine tuning multimodal model (2025)
Analyzing Fine-tuning Representation Shift for Multimodal LLMs Steering
Multimodal LLMs have reached remarkable levels of proficiency in understanding multimodal inputs, driving extensive research to develop increasingly powerful models. However, much less attention has been paid to understanding and explaining the underlying mechanisms of these models. Most existing explainability research examines these models only in their final states, overlooking the dynamic representational shifts that occur during training. In this work, we systematically analyze the evolution of hidden state representations to reveal how fine-tuning alters the internal structure of a model to specialize in new multimodal tasks. Using a concept-based approach, we map hidden states to interpretable visual and textual concepts, enabling us to trace changes in encoded concepts across modalities as training progresses. We also demonstrate the use of shift vectors to capture these concepts changes. These shift vectors allow us to recover fine-tuned concepts by shifting those in the original model. Finally, we explore the practical impact of our findings on model steering, showing that we can adjust multimodal LLMs behaviors without any training, such as modifying answer types, captions style, or biasing the model toward specific responses. Our work sheds light on how multimodal representations evolve through fine-tuning and offers a new perspective for interpreting model adaptation in multimodal tasks. The code for this project is publicly available at https://github.com/mshukor/xl-vlms.
Using Jigsaw Toxic Comment dataset, token-wise toxicity signals were extracted from value vectors in GPT2-medium's MLP blocks, and the toxicity representation space was decomposed through SVD. After
DPO
training analysis, it was found that the model learned offsets to bypass toxic vector activation regions. However, toxic outputs can still be reproduced with minor adjustments.
A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity
HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.
Fine-tuning LLMs can lead to unwanted generalizations in out-of-distribution (OOD) situations (
Emergent Misalignment
). CAFT (Concept Ablation Fine-Tuning) removes unwanted concepts by orthogonally projecting out those directional components, preventing the model from using these concepts. This reduced Qwen misalignment from 7.0% to 0.39%. SAE-based methods outperform PCA in some tasks.
arxiv.org
Model Diffing
with
Patchscopes
reveals that narrow finetuning creates strong bias in the model's internal representations, causing activation differences related to that domain to appear even in unrelated inputs. In other words, it makes the model continuously "think about" the finetuning content.
Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences — AI Alignment Forum
This is a preliminary research update. We are continuing our investigation and will publish a more in-depth analysis soon. The work was done as part…
Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences — AI Alignment Forum
 
 

Recommendations