SPD, APD++

- Rank-1 subcomponents decompose each layer's weights, representing the entire matrix as multiple low-dimensional components
- Probabilistic masking randomly removes "unnecessary" components for each input while minimizing reconstruction loss
- A small MLP is trained to predict the "causal importance" of each subcomponent, encouraging as many components as possible to be deactivated
- With less hyperparameter tuning than APD, it accurately recovers original mechanisms in various toy models (superposition, distributed representations, compressed computation, etc.)
Traditional interpretability focused on neuron activation-space has limitations. The new approach interprets the parameter-space itself by decomposing each weight matrix into a sum of multiple rank-1 matrices (outer products), defined as subcomponents. This is because rank-1 is the minimal computation unit that "detects a specific direction in the input → records a signal in a specific direction in the output". The method assumes that only a few subcomponents are activated per input, and confirms that ablating (removing) unnecessary subcomponents doesn't affect the output.
- Weight faithfulness: The sum of decomposed components must match the original weights.
- Stochastic reconstruction: Output is maintained even when randomly removing unnecessary subcomponents for each input.
- Minimality: Encourages computation using as few subcomponents as possible.
While Activation Engineering applies to dynamic situations, parameter decomposition enables permanent deletion/editing of specific knowledge.