SAE Feature Distribution

Creator
Creator
Seonglae Cho
Created
Created
2025 Jan 28 16:57
Editor
Edited
Edited
2025 Feb 27 14:40
Refs
Refs

SAE Feature Density often refers non-zero ratio

feature density of each feature is the fraction of tokens on which the feature has a nonzero value.

almost all of the features in the high density cluster are interpretable, but almost none of the features in the ultralow density cluster are.
notion image
 
https://www.lesswrong.com/posts/f9EgfLSurAiqRJySD/open-source-sparse-autoencoders-for-all-residual-stream
 
notion image
 

Problems about density

  • High frequency features which are common in Top-k, JumpReLU SAEs activates on 10% of input tokens. Their dynamics and semantics are unknown until now.
 
 
 

feature density histogram

Layer wise visualized analysis

  • SAE's reconstruction performance degrades sharply when exceeding the training context length
  • In short contexts, performance worsens in later layers, while in long contexts, early layers show degraded performance
  • While most SAE feature steering negatively impacts model performance, some features lead to improvements
  • Errors in early layer SAEs negatively affect the performance of later layers
 
 
 
 

Recommendations