diff-SAE

Diff-SAE

Instead of reconstructing the activation itself with SAE, diff-SAE reconstruct to decompose it using the crosscoder approach for model diffing.

Latent Scaling

Reconstruction coefficient , latent coefficient . That is, the reconstruction coefficient is the global regression coefficient per latent for the latent coefficient.

We calculate the reconstruction coefficients for each latent direction in both the base and chat models, denoted as and respectively.

Through this, we can define Model Specificity for each latent. When ν_j ≈ 0, it indicates a chat model-specific latent, while ν_j ≈ 1 suggests a latent shared between both base and chat models, enabling successful model diffing.

When training the same crosscoder on both chat and base models simultaneously with a common dictionary, the previous approach only considered representation-norm, determining if the base representation was zero or not to identify chat-specific features. This led to Crosscoder extracting too many "false chat-specific" latents. However, even if the base side norm is zero due to L1 regularization, that latent may still be significantly used in the base model. Therefore, by considering vector direction and using the reconstruction coefficient for scale correction, we successfully separated chat-only features more clearly.