SAE Feature Splitting

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Jan 30 0:28
Editor
Edited
Edited
2026 Jan 3 22:33

Exploring SAE hierarchy is very important and valuable

notion image

Feature Fragmentation

notion image
Split features were identified through masked cosine similarity between decoded neuron activations. The structure of this refinement is more complex than a tree: rather, the features we find at one level may both split and merge to form refined features at the next.
 
 
 

SAE Feature Splitting

There might some idealized set of features that dictionary learning would return if we provided it with an unlimited dictionary size. (true features). However correct number of features for dictionary learning is less important than it might initially seem. The fact that true features are clustered into sets of similar features suggests that dictionary learning with fewer features can provide a "summary" of model features.
feature fragmentation
For embedding

Cos sim

When SAEs are scaled up (with more latents), "feature splitting" occurs (e.g., "math" → "algebra/geometry"), but this isn't always a good decomposition. While there appear to be monosemantic latents like "starts with S," in practice they suddenly fail to activate in certain cases (false negatives), and instead more specific child/token-aligned latents absorb that directional component and explain the model's behavior.
For features that fire independently, SAEs recover them well, but when hierarchical co-occurrence is introduced (e.g., "feature1 only appears when feature0 is present"), absorption occurs where the encoder creates gaps (parent latent turns off in certain situations). Generally, the more sparse and wider the SAE, the greater the tendency for absorption.
SAEs assume concepts are independent and stationary over time, but actual LM activations exhibit strong temporal correlations and non-stationarity. SAE's temporal independence and fixed sparsity assumptions lead to bottlenecks such as
SAE Feature Splitting
.
Temporal Feature Analysis (TFA) decomposes activations into predictable (slow, contextual) components and novel (fast, residual) components. It outperforms SAE in garden-path sentence parsing, event boundary detection, and capturing long-range structure. In other words, interpretability tools require
Inductive Bias
aligned with the temporal structure of the data.
 
 

Recommendations