Weight Interpretability

Parameter Interpretability

Weights are a vector in parameter space. Attribution is an effect of weight and feature is an effect of representation. The motivation for the weight-similarity is to avoid components sharing param.

SVD cannot treat
Superposition Hypothesis

NMF also limited to
Superposition Hypothesis

Weight Interpretability Notion

Weight Superposition

Super Weight

Weight Changing

Weight Interpretability Methods

APD

Meta SAE

Stochastic Parameter Decomposition

MathNeuro

Bilinear MLPs

arxiv.org

https://arxiv.org/pdf/2410.08417

Achille and Soatto (2018) studied the amount of information stored in the weights of deep networks

Emergence of Invariance and Disentanglement in Deep Representations

Using established principles from Statistics and Information Theory, we show that invariance to nuisance factors in a deep neural network is equivalent to information minimality of the learned...

https://arxiv.org/abs/1706.01350

There is little superposition in parameter space. Linearity in parameter space is a reasonable assumption.

arxiv.org

https://arxiv.org/pdf/2505.15811

arxiv.org

https://arxiv.org/pdf/1804.08838

Transformers contain a core subnetwork with very few parameters (≈10 million) that can nearly perfectly perform bigram (previous token only) next-token prediction (achieving r>0.95 bigram reproduction even in models up to 1B parameters). These are essential to model performance (concentrated primarily in the first MLP layer). Ablating them causes performance to collapse dramatically.

The first layer induces a sharp rotation from current token → next token space. The layer simply reorients activations from a 'coordinate system that describes the current token' to a 'coordinate system that describes the next token'. This serves as the minimal starting point for complex circuit analysis (minimal circuit).

arxiv.org

https://arxiv.org/pdf/2504.15471v2

Weight Interpretability

Parameter Interpretability

Backlinks

Recommendations