SAE Feature Absorption

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2024 Oct 24 11:18
Editor
Edited
Edited
2025 Dec 24 0:43

Feature Absorption reduces interpretability

Exploring SAE hierarchy is very important and valuable
When an SAE learns two separate features describing the same ground-truth feature, representations of that feature are split between the two learned features randomly.
Although the SAE appears to track a specific interpretable feature, in reality, it creates gaps in predictions and other unrelated latent variables absorb that feature
It seems to be an issue that occurs due to decomposing too sparsely
For example, the interpretable feature 'starts with L' is not activated under certain conditions, and instead, latent variables related to specific tokens like 'lion' absorb that direction.
It was discovered that sharing weights between the SAE encoder and decoder reduces Feature Absorption
 
 

Cos sim

When SAEs are scaled up (with more latents), "feature splitting" occurs (e.g., "math" → "algebra/geometry"), but this isn't always a good decomposition. While there appear to be monosemantic latents like "starts with S," in practice they suddenly fail to activate in certain cases (false negatives), and instead more specific child/token-aligned latents absorb that directional component and explain the model's behavior.
For features that fire independently, SAEs recover them well, but when hierarchical co-occurrence is introduced (e.g., "feature1 only appears when feature0 is present"), absorption occurs where the encoder creates gaps (parent latent turns off in certain situations). Generally, the more sparse and wider the SAE, the greater the tendency for absorption.
Clustering
HDBSCAN
The
SAE Feature Absorption
and co-occurrence problems cause the model to learn "broken latents". While tied SAEs have cleaner representations due to identical encoder and decoder weights, issues still arise when there are insufficient latents for concepts like parent-child relationships.
To mitigate this mixing phenomenon, an auxiliary loss function (squared cosine similarity between inputs and feature directions at low activation states) is introduced to encourage single peaks in activation strength.
 
 

Recommendations