Model diffing is a method for precisely comparing internal representations or functional differences between different neural networks (or different versions of the same model)
- Diffing models as a way to make safety auditing easier
General methods
- KL Divergence between token probability
- Swapping weights
Model Diffing Methods
2018
2022
CrossCoder (2024)
with Cross fine-tuning model & scaling transferability by diffing within same architecture