Model Diffing

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Jan 15 15:37
Editor
Edited
Edited
2026 Feb 13 12:44
Model diffing is a method for precisely comparing internal representations or functional differences between different neural networks (or different versions of the same model)
  • Diffing models as a way to make safety auditing easier

General methods

Model Diffing Methods
 
 
 
 
2018
Chris Olah’s views on AGI safety — AI Alignment Forum
In thinking about AGI safety, I’ve found it useful to build a collection of different viewpoints from people that I respect, such that I can think fr…
Chris Olah’s views on AGI safety — AI Alignment Forum
2022
arxiv.org
Stage-Wise Model Diffing
This work presents a novel approach to "model diffing" with dictionary learning that reveals how the features of a transformer change from finetuning. This approach takes an initial sparse autoencoder (SAE) dictionary trained on the transformer before it has been finetuned, and finetunes the dictionary itself on either the new finetuning dataset or the finetuned transformer model. By tracking how dictionary features evolve through the different fine-tunes, we can isolate the effects of both dataset and model changes.
Using
Crosscoder
for chat
Model Diffing
reveals issues with traditional L1 sparsity approaches: many "chat-specific features" are falsely identified because they are actually existing concepts that shrink to zero in one model during training. Most chat-exclusive latents are training artifacts rather than genuine new capabilities.
Complete Shrinkage → A shared concept where one model's decoder shrinks to zero. Latent Decoupling → The same concept is represented by different latent combinations in two models.
Using Top-K (L0-style) sparsity instead of L1 reduces false positives and retains only alignment-related features. Chat tuning effects are primarily not about capabilities themselves, but rather: safety/refusal mechanisms, dialogue format processing, response length and summarization controls, and template token-based control. In other words, it acts more like a shallow layer that steers existing capabilities.
arxiv.org
 

Recommendations