SAE Feature Density often refers non-zero ratio
feature density of each feature is the fraction of tokens on which the feature has a nonzero value.
almost all of the features in the high density cluster are interpretable, but almost none of the features in the ultralow density cluster are.
Problems about density
- High frequency features which are common in Top-k, JumpReLU SAEs activates on 10% of input tokens. Their dynamics and semantics are unknown until now.
feature density histogram
Layer wise visualized analysis
- SAE's reconstruction performance degrades sharply when exceeding the training context length
- In short contexts, performance worsens in later layers, while in long contexts, early layers show degraded performance
- While most SAE feature steering negatively impacts model performance, some features lead to improvements
- Errors in early layer SAEs negatively affect the performance of later layers