Attribution-based Parameter Decomposition
Minimizing Mechanistic Description Length to decompose neural network parameters into mechanistic components. APD directly decomposes a neural network’s parameters into components that are faithful to the parameters of the original network, require a minimal number of components to process any input, and are maximally simple.
To substitute traditional Matrix Decomposition
Desirable properties:
- Faithfulness: The decomposition should identify a set of components that sum to the parameters of the original network.
- Minimality: The decomposition should use as few components as possible to replicate the network’s behavior on its training distribution.
- Simplicity: Components should each involve as little computational machinery as possible.
Superposition Hypothesis
- Right singular vectors align with the activation directions that lead the parameter components to have downstream causal effects (update to align output)
- Left singular vectors are the directions in which activations have downstream causal effects and align gradients that activate that component (update to align input)
- Parameter components can be decomposed as an outer product of their (un-normed) left and right singular vectors
Loss
We decompose the network’s parameters into a set parameter components and directly optimizes them to be faithful, minimal, and simple. APD can be understood as an instance of a broader class of ‘linear parameter decomposition’.
We decompose a network’s parameters where indexes the network’s weight matrices and index rows and columns, by defining a set of parameter components . Their sum is trained to minimize the MSE with respect to the target network’s parameters, .
We sum only the top-k most attributed parameter components, yielding a new parameter vector , and use it to perform a forward pass. We train the output of the top-k most attributed parameter components to match the target network’s by minimizing , where is some distance or divergence measure.
where are the singular values of parameter component in layer . This is also known as the Schatten- norm.
For our the loss term that we use to train our parameter components, we want a decomposition that approximately sums to the target parameters and MDL.