High Activation Density could mean either that sparsity was not properly learned, or that it is an important feature needed in various situations. In the Feature Browser, SAE features show higher feature interpretability when they have more high activation Quantile, which demonstrates a limitation where SAE features have low interpretability for low activations and exhibit certain skewness.
However, features with the highest Activation Density in the Activation Distribution are less interpretable, mainly because these features typically don't have high activation values in absolute terms (not quantile). A well-classified and highly interpretable SAE feature should not show density that simply decreases with activation value, but rather should show clustering at high activation levels after an initial decrease.
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
Mechanistic interpretability seeks to understand neural networks by breaking them into components that are more easily understood than the whole. By understanding the function of each component, and how they interact, we hope to be able to reason about the behavior of the entire network. The first step in that program is to identify the correct components to analyze.
https://transformer-circuits.pub/2023/monosemantic-features#global-analysis-interp-caveats
Dense SAE Latents Are Features, Not Bugs
The residual stream contains directions that change "next token semantics (which word will appear)" as well as directions that barely change semantics but only alter "confidence/entropy (sharpness of distribution)" is the claim. This paper shows that the latter (=confidence control) is predominantly captured as dense SAE latents.
Captures the intrinsically existing dense subspace in the residual stream. When retraining on a subspace with dense latents removed, almost no dense latents emerge → not a training artifact. Dense latents appear as antipodal pairs (±directional pairs) representing one direction.
Role classification: position tracking, context binding, entropy regulation (nullspace, Kernel), alphabet/output signals, POS/semantic words, PCA reconstruction. Previous thought: nullspace = meaningless / garbage dimensions, but this result shows: nullspace = control channels intentionally used by the model
One potential from Adam Optimizer
Privileged Bases in the Transformer Residual Stream
Our mathematical theories of the Transformer architecture suggest that individual coordinates in the residual stream should have no special significance (that is, the basis directions should be in some sense "arbitrary" and no more likely to encode information than random directions). Recent work has shown that this observation is false in practice. We investigate this phenomenon and provisionally conclude that the per-dimension normalizers in the Adam optimizer are to blame for the effect.
https://transformer-circuits.pub/2023/privileged-basis/index.html

Seonglae Cho