Superposition Hypothesis

Creator
Creator
Seonglae Cho
Created
Created
2024 Apr 2 4:45
Editor
Edited
Edited
2025 Mar 3 15:29

The superposition hypothesis postulates that neural networks “want to represent more features than they have neurons”.

It is consistent with the fact that actual brain activation does not occur all at once, in that only some of the model's neurons are used to prevent overlapping between superposed functions. The idea that we only use 10% of our brain's capacity came about because there was no understanding of superposition, but in reality, using it that way would render it meaningless.

Cross entropy loss verification

We can calculate the cross-entropy loss achieved by this model in a few cases:
  1. Suppose the neuron only fires on feature A, and correctly predicts token A when it does. The model ignores all of the other features, predicting a uniform distribution over tokens B/C/D when feature A is not present. In this case the loss is
  1. Instead suppose that the neuron fires on both features A and B, predicting a uniform distribution over the A and B tokens. When the A and B features are not present, the model predicts a uniform distribution over the C and D tokens. In this case the loss is 
Models trained on cross-entropy loss will generally prefer to represent more features polysemantically than to represent monosemantically even in cases where sparsity constraints make superposition impossible. Models trained on other loss functions do not necessarily suffer this problem. This is the reason why
Neuron SAE
use MSE reconstruction loss and sparsity loss to make monosemantic latent dictionary.

Compression

Because the loss is lower in case (2) than in case (1), the model achieves better performance by making its sole neuron polysemantic, even though there is no superposition.
notion image
The feature activations should be sparse, because sparsity is what enables this kind of noisy simulation. (
Compressed sensing
)
It trained with 2 features and 1 activation relu, they were divided antipodally. As expected, sparsity is necessary for superposition to occur
notion image
notion image
The implementation of the connection to something that should come to mind along with the corresponding memory can be interpreted as a form of superposition. Additionally, the need for a backup neuron due to
Dropout
is also one interpretation.

Architectural approach

There are highlighted several different approaches to solving superposition. One of those approaches was to engineer models to simply not have superposition in the first place.
notion image
It is generally accepted that in 10 dimensions, you can use 10 orthogonal bases, but in fact, as shown above, you can use 5 bases in 2 dimensions. Even with some interference.
But it isn't always the case that features correspond so cleanly to neurons, especially in large language models where it actually seems rare for neurons to correspond to clean features.
In this paper, Anthropic uses toy models — small ReLU networks trained on synthetic data with sparse input features. When features are sparse, superposition allows compression beyond what a linear model would do, at the cost of "interference" that requires nonlinear filtering.
  • Superposition is a real, observed phenomenon.
  • Both monosemantic and polysemantic neurons can form.
  • At least some kinds of computation can be performed in superposition.
  • Whether features are stored in superposition is governed by a phase change.
  • Superposition organizes features into geometric structures

Phase change

notion image
If we make it sufficiently sparse, there's a phase change, and it collapses from a pentagon to a pair of digons with the sparser point at zero. The phase change corresponds to loss curves corresponding to the two different geometries crossing over
A more complicated form of non-uniform superposition occurs when there are correlations between features. This seems essential for understanding superposition in the real world, where many features are correlated or anti-correlated.
notion image

Phase change for
In-context learning

Induction heads may be the mechanistic source of general in-context learning in transformer models of any size
Phase change occurs early in training for language models of every size (provided they have more than one layer), and which is visible as a bump in the training loss. During this phase change, the majority of in-context learning ability (as measured by difference in loss between tokens early and late in the sequence) is acquired, and simultaneously induction heads form within the model that are capable of implementing fairly abstract and fuzzy versions of pattern completion.
Is there a way we could understand what "fraction of a dimension" a specific feature gets?
notion image
Perhaps the most striking phenomenon the Anthropic have noticed is that the learning dynamics of toy models with large numbers of features appear to be dominated by "energy level jumps" where features jump between different feature dimensionalities.
notion image

Interference in
Residual Stream

Compressing many small neural network into one
  • Read-in interference
  • Read-out interference
 
 
 
Thought vector (2016,
Gabriel Goh
)
2018 Superposition
2021

2022 Amazing works

(2023,
Chris Olah
) Distributed Representations: Composition & Superposition
Composition and superposition are opposing representational methods, and it is important to distinguish between them (tradeoff). Composition-based representations excel in interpretability and generalization, while superposition-based representations have high spatial efficiency.
2024
Interference in superposition and how to deal with that

Contradict 2024

Traditional interpretations assume it works through linear combinations, but in reality, the positions of each feature vector play a crucial role. Days of the week or months are arranged in a circular pattern, and position embeddings form a helix. While superposition itself is certain, whether it operates as a linear combination remains unclear.
The core argument is that how feature vectors are positioned relative to each other in terms of distance, direction, or patterns provides additional meaning in how the model processes information, but the counterargument suggests that "position" - the structural information that results from selecting a specific basis - is not essential.
 
 
 

Recommendations