Parameter Interpretability
Weights are a vector in parameter space. Attribution is an effect of weight and feature is an effect of representation. The motivation for the weight-similarity is to avoid components sharing param.
- SVD cannot treat Superposition Hypothesis
- NMF also limited to Superposition Hypothesis
Weight Interpretability Notion
Weight Interpretability Methods
Bilinear MLPs
arxiv.org
https://arxiv.org/pdf/2410.08417
Achille and Soatto (2018) studied the amount of information stored in the weights of deep networks
Emergence of Invariance and Disentanglement in Deep Representations
Using established principles from Statistics and Information Theory, we show that invariance to nuisance factors in a deep neural network is equivalent to information minimality of the learned...
https://arxiv.org/abs/1706.01350

There is little superposition in parameter space. Linearity in parameter space is a reasonable assumption.
arxiv.org
https://arxiv.org/pdf/2505.15811
arxiv.org
https://arxiv.org/pdf/1804.08838
Transformers contain a core subnetwork with very few parameters (≈10 million) that can nearly perfectly perform bigram (previous token only) next-token prediction (achieving r>0.95 bigram reproduction even in models up to 1B parameters). These are essential to model performance (concentrated primarily in the first MLP layer). Ablating them causes performance to collapse dramatically.
The first layer induces a sharp rotation from current token → next token space. The layer simply reorients activations from a 'coordinate system that describes the current token' to a 'coordinate system that describes the next token'. This serves as the minimal starting point for complex circuit analysis (minimal circuit).
arxiv.org
https://arxiv.org/pdf/2504.15471v2
Circuit-level reverse engineering
The ML puzzle released by Jane Street was the problem of "interpreting what a neural network with fully disclosed weights actually computes." At first glance, it appeared to be a strange network that almost always outputs 0, but analyzing the final layers revealed that it was internally checking whether the input matches a specific 16-byte value. After extensive analysis, a participant named Alex discovered that the network was actually a hand-implemented complex computation, and the key insight was that it is a circuit-style neural network that computes MD5 hashes. Initial attempts to brute-force it with SAT solvers and linear programs failed due to excessive complexity, but by observing the repeating layer structure and reasoning toward hash functions, Alex ultimately identified it as MD5.
A tip for neural network reverse engineering is to start from the last layer. Look at the layer sizes and repeating structures, and consider what the periodic repetition patterns resemble.
Can you reverse engineer our neural network?
A lot of “capture-the-flag” style ML puzzles give you a black box neural net, and your job is to figure out what it does. When we were thinking of creating o...
https://blog.janestreet.com/can-you-reverse-engineer-our-neural-network/


Seonglae Cho