SAE Feature Distribution

Creator

Creator

Seonglae Cho

Created

Created

2025 Jan 28 16:57

Editor

Editor

Seonglae Cho

Edited

Edited

2025 Feb 27 14:40

Refs

Refs

SAE Feature Density often refers non-zero ratio

feature density of each feature is the fraction of tokens on which the feature has a nonzero value.

almost all of the features in the high density cluster are interpretable, but almost none of the features in the ultralow density cluster are.

notion image

https://www.lesswrong.com/posts/f9EgfLSurAiqRJySD/open-source-sparse-autoencoders-for-all-residual-stream

notion image

Problems about density

High frequency features which are common in Top-k, JumpReLU SAEs activates on 10% of input tokens. Their dynamics and semantics are unknown until now.

feature density histogram

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Mechanistic interpretability seeks to understand neural networks by breaking them into components that are more easily understood than the whole. By understanding the function of each component, and how they interact, we hope to be able to reason about the behavior of the entire network. The first step in that program is to identify the correct components to analyze.

https://transformer-circuits.pub/2023/monosemantic-features#appendix-feature-density

Layer wise visualized analysis

SAE's reconstruction performance degrades sharply when exceeding the training context length

In short contexts, performance worsens in later layers, while in long contexts, early layers show degraded performance

While most SAE feature steering negatively impacts model performance, some features lead to improvements

Errors in early layer SAEs negatively affect the performance of later layers

Examining Language Model Performance with Reconstructed Activations using Sparse Autoencoders — LessWrong

Note: The second figure in this post originally contained a bug pointed out by @LawrenceC, which has since been fixed. …

Examining Language Model Performance with Reconstructed Activations using Sparse Autoencoders — LessWrong

https://www.lesswrong.com/posts/8QRH8wKcnKGhpAu2o/examining-language-model-performance-with-reconstructed

Examining Language Model Performance with Reconstructed Activations using Sparse Autoencoders — LessWrong

Backlinks

Recommendations

/////////////