SAE Feature absorption

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2024 Oct 24 11:18
Editor
Edited
Edited
2024 Oct 24 23:44
Refs
Refs

Feature Absorption reduces interpretability

When an SAE learns two separate features describing the same ground-truth feature, representations of that feature are split between the two learned features randomly.
Although the SAE appears to track a specific interpretable feature, in reality, it creates gaps in predictions and other unrelated latent variables absorb that feature
It seems to be an issue that occurs due to decomposing too sparsely
For example, the interpretable feature 'starts with L' is not activated under certain conditions, and instead, latent variables related to specific tokens like 'lion' absorb that direction.
It was discovered that sharing weights between the SAE encoder and decoder reduces Feature Absorption
 
 
 
 
 

Recommendations