SPD, APD++

- Rank-1 subcomponents decompose each layer's weights, representing the entire matrix as multiple low-dimensional components
- Probabilistic masking randomly removes "unnecessary" components for each input while minimizing reconstruction loss
- A small MLP is trained to predict the "causal importance" of each subcomponent, encouraging as many components as possible to be deactivated
- With less hyperparameter tuning than APD, it accurately recovers original mechanisms in various toy models (superposition, distributed representations, compressed computation, etc.)
Towards Scalable Parameter Decomposition
The most successful methods so far, like SAEs, have focused largely on the activations that flow through a model, rather than directly inspecting the weights that transform and guide the inputs into those flows. It's a bit like trying to understand a program by only looking at its runtime variables, but never its source code. Parameter decomposition offers a way to decompose a model’s parameters—the 'source code'—into components that reveal not only what the network computes, but how it computes it. Today, we're releasing a paper on Stochastic Parameter Decomposition (SPD), which removes key barriers to the scalability of prior methods.
https://www.goodfire.ai/papers/stochastic-param-decomp

Traditional interpretability focused on neuron activation-space has limitations. The new approach interprets the parameter-space itself by decomposing each weight matrix into a sum of multiple rank-1 matrices (outer products), defined as subcomponents. This is because rank-1 is the minimal computation unit that "detects a specific direction in the input → records a signal in a specific direction in the output". The method assumes that only a few subcomponents are activated per input, and confirms that ablating (removing) unnecessary subcomponents doesn't affect the output.
- Weight faithfulness: The sum of decomposed components must match the original weights.
- Stochastic reconstruction: Output is maintained even when randomly removing unnecessary subcomponents for each input.
- Minimality: Encourages computation using as few subcomponents as possible.
While Activation Engineering applies to dynamic situations, parameter decomposition enables permanent deletion/editing of specific knowledge.
Making sense of parameter-space decomposition — LessWrong
As we all know, any sufficiently-advanced technology is indistinguishable from magic. Accordingly, whenever such artifact appears, a crowd of researc…
https://www.lesswrong.com/posts/Wo22C8vhveDbDWhAc/making-sense-of-parameter-space-decomposition

Seonglae Cho