SAE Feature Distribution

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Jan 28 16:57
Editor
Edited
Edited
2025 Feb 27 14:40
Refs
Refs

SAE Feature Density often refers non-zero ratio

feature density of each feature is the fraction of tokens on which the feature has a nonzero value.

almost all of the features in the high density cluster are interpretable, but almost none of the features in the ultralow density cluster are.
notion image
 
https://www.lesswrong.com/posts/f9EgfLSurAiqRJySD/open-source-sparse-autoencoders-for-all-residual-stream
 
notion image
 

Problems about density

  • High frequency features which are common in Top-k, JumpReLU SAEs activates on 10% of input tokens. Their dynamics and semantics are unknown until now.
 
 
 

feature density histogram

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
Mechanistic interpretability seeks to understand neural networks by breaking them into components that are more easily understood than the whole. By understanding the function of each component, and how they interact, we hope to be able to reason about the behavior of the entire network. The first step in that program is to identify the correct components to analyze.

Layer wise visualized analysis

  • SAE's reconstruction performance degrades sharply when exceeding the training context length
  • In short contexts, performance worsens in later layers, while in long contexts, early layers show degraded performance
  • While most SAE feature steering negatively impacts model performance, some features lead to improvements
  • Errors in early layer SAEs negatively affect the performance of later layers
Examining Language Model Performance with Reconstructed Activations using Sparse Autoencoders — LessWrong
Note: The second figure in this post originally contained a bug pointed out by @LawrenceC, which has since been fixed. …
Examining Language Model Performance with Reconstructed Activations using Sparse Autoencoders  — LessWrong
 
 
 
 

Backlinks

SAE Loss

Recommendations