SAE Latent Unit
SAE enables feature disentanglement from Superposition
Responds to both abstract discussions and concrete examples
There are features representing more abstract properties of the input, might there also be more abstract, higher-level actions which trigger behaviors over the span of multiple tokens?
SAE Feature Notion
SAE Feature Steering
SAE Feature Universality
SAE Feature Structure
SAE Dead Neuron
SAE Feature Distribution
SAE Feature Matching
SAE Feature Splitting
SAE Feature Visualization
SAE Feature Shrinkage
SAE Feature Absorption
SAE pathological error
SAE feature bimodality
SAE Bounce Plot
SAE feature importance
SAE Feature Direction
SAE Feature Stitching
SAE Feature Metrics
Insane visualization of feature activation distribution with conditional distribution and log-likelihood
High Activation Density could mean either that sparsity was not properly learned, or that it is an important feature needed in various situations. In the Feature Browser, SAE features show higher feature interpretability when they have more high activation Quantile, which demonstrates a limitation where SAE features have low interpretability for low activations and exhibit certain skewness.
However, features with the highest Activation Density in the Activation Distribution are less interpretable, mainly because these features typically don't have high activation values in absolute terms (not quantile). A well-classified and highly interpretable SAE feature should not show density that simply decreases with activation value, but rather should show clustering at high activation levels after an initial decrease.
Automated Interpretability Metrics proposal from EleutherAI