APD

Creator
Creator
Seonglae Cho
Created
Created
2025 Jan 28 14:3
Editor
Edited
Edited
2025 Apr 14 21:22
Refs
Refs
MDL
MMCS

Attribution-based Parameter Decomposition

Minimizing Mechanistic Description Length to decompose neural network parameters into mechanistic components. APD directly decomposes a neural network’s parameters into components that are faithful to the parameters of the original network, require a minimal number of components to process any input, and are maximally simple.

To substitute traditional
Matrix Decomposition

Desirable properties:

  • Faithfulness: The decomposition should identify a set of components that sum to the parameters of the original network.
  • Minimality: The decomposition should use as few components as possible to replicate the network’s behavior on its training distribution.
  • Simplicity: Components should each involve as little computational machinery as possible.
notion image

Superposition Hypothesis

  • Right singular vectors align with the activation directions that lead the parameter components to have downstream causal effects (update to align output)
  • Left singular vectors are the directions in which activations have downstream causal effects and align gradients that activate that component (update to align input)
  • Parameter components can be decomposed as an outer product of their (un-normed) left and right singular vectors
notion image

Loss

We decompose the network’s parameters into a set parameter components and directly optimizes them to be faithful, minimal, and simple. APD can be understood as an instance of a broader class of ‘linear parameter decomposition’.
We decompose a network’s parameters where indexes the network’s weight matrices and index rows and columns, by defining a set of parameter components . Their sum is trained to minimize the MSE with respect to the target network’s parameters, .
We sum only the top-k most attributed parameter components, yielding a new parameter vector , and use it to perform a forward pass. We train the output of the top-k most attributed parameter components to match the target network’s by minimizing , where is some distance or divergence measure.
where are the singular values of parameter component in layer . This is also known as the Schatten- norm.
For our the loss term that we use to train our parameter components, we want a decomposition that approximately sums to the target parameters and
MDL
.
 
 
 
 

Recommendations