SAE Feature Visualization

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Feb 15 21:20
Editor
Edited
Edited
2025 Feb 24 21:34
Refs
Refs
 
 
 
 

UMAP

Browsing code error feature

Layer wise visualized analysis

  • SAE's reconstruction performance degrades sharply when exceeding the training context length
  • In short contexts, performance worsens in later layers, while in long contexts, early layers show degraded performance
  • While most SAE feature steering negatively impacts model performance, some features lead to improvements
  • Errors in early layer SAEs negatively affect the performance of later layers

Cos sim

When SAEs are scaled up (with more latents), "feature splitting" occurs (e.g., "math" → "algebra/geometry"), but this isn't always a good decomposition. While there appear to be monosemantic latents like "starts with S," in practice they suddenly fail to activate in certain cases (false negatives), and instead more specific child/token-aligned latents absorb that directional component and explain the model's behavior.
For features that fire independently, SAEs recover them well, but when hierarchical co-occurrence is introduced (e.g., "feature1 only appears when feature0 is present"), absorption occurs where the encoder creates gaps (parent latent turns off in certain situations). Generally, the more sparse and wider the SAE, the greater the tendency for absorption.
 
 
 

 

Recommendations