SAE Transferability
SAEs (usually) Transferability Between Base and Chat Models
sae-transfer
ckkissane • Updated 2025 Feb 2 4:8
SAE analysis with fine tuning multimodal model (2025)
Using Jigsaw Toxic Comment dataset, token-wise toxicity signals were extracted from value vectors in GPT2-medium's MLP blocks, and the toxicity representation space was decomposed through SVD. After DPO training analysis, it was found that the model learned offsets to bypass toxic vector activation regions. However, toxic outputs can still be reproduced with minor adjustments.
BatchTopK crosscoder to prevent Complete Shrinkage and Latent Decoupling for chat model
Fraction of variance unexplained when using SAEs trained on Gemma 2 2B PT to reconstruct the activations generated with Gemma 2 2B IT on user prompts.