SAE Feature Splitting

Feature Fragmentation

Split features were identified through masked cosine similarity between decoded neuron activations. The structure of this refinement is more complex than a tree: rather, the features we find at one level may both split and merge to form refined features at the next.

SAE Feature Splitting

There might some idealized set of features that dictionary learning would return if we provided it with an unlimited dictionary size. (true features). However correct number of features for dictionary learning is less important than it might initially seem. The fact that true features are clustered into sets of similar features suggests that dictionary learning with fewer features can provide a "summary" of model features.

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Mechanistic interpretability seeks to understand neural networks by breaking them into components that are more easily understood than the whole. By understanding the function of each component, and how they interact, we hope to be able to reason about the behavior of the entire network. The first step in that program is to identify the correct components to analyze.