SAE Feature Stitching

Creator
Creator
Seonglae Cho
Created
Created
2025 Mar 7 17:15
Editor
Edited
Edited
2025 Jun 2 1:23
Refs
Refs

Exchanging latent features across different size of SAEs

Reconstruction latent (
SAE Feature Splitting
,
SAE Feature Absorption
)

If performance degrades or remains unchanged after adding it, that latent is judged to be a more detailed representation of latents already present in the smaller model, which we call a reconstruction latent

Latent Novel

A latent novel is identified when adding individual latents from a larger model to a smaller model improves reconstruction performance (e.g., MSE), indicating that these latents contain new information not present in the smaller model.
 
 
 
 
Combined SAE and
NMF
to transform the model's internal representations into human-understandable units, making the (black box) diffusion model transparently manipulatable. Hundreds of SAE features were grouped using NMF into several high-level units (factors), combining the
SAE Feature Splitting
through NMF. In the equation V=WHV=WH, where V is the original SAE activation strength matrix, each row of H represents a high-level factor, and the values in that row represent the weights of the corresponding SAE features.
 
 

Recommendations