E2E SAE

Creator
Creator
Seonglae Cho
Created
Created
2025 Jan 21 14:58
Editor
Edited
Edited
2025 Mar 7 17:5
Refs
Traditional Sparse Autoencoders (SAEs) focus on MSE loss to reconstruct activation values, which may not adequately capture functionally important features of the network. Therefore, learning is done by minimizing KL divergence with the original output distribution. However, this methodology still focuses on features related to reproducing the final LLM distribution, with no evidence of improving the capture of important features.
A drawback of E2E SAE is that seed robustness causes the feature set to become vulnerable. Through SAEe2e+ds, an additional loss term is added to minimize reconstruction error of subsequent layers, maintaining the network's internal computation paths more similar to the original.
SAEs may learn more about the structure of the dataset than the computational structure of the network. So Minimizing the KL divergence between the output distributions of the original model and the model with SAE.
L=LKL+Lsparsity+LdownstreamL = L_{KL} + L_{sparsity} + L_{downstream}
notion image

Downstream reconstruction

The downstream reconstruction loss computes the MSE between the activations of the SAE-inserted model and the original LLM across all downstream layers.
 
 
 
 
 
 

Recommendations