VPD

adVersarial Parameter Decomposition (VPD)

Activation-based interpretability methods such as SAEs or Transcoders have a fundamental limitation: because their functional form differs from the original model, precise model editing is difficult, and reliability can degrade in out-of-distribution (OOD) settings. To address this “functional form mismatch,” the authors reconstruct the model’s mechanisms directly in parameter space as a sum of rank-1 subcomponents, enabling analysis of the “code” the model actually uses for a given input.

The core idea of VPD is to decompose a weight matrix into a sum of rank-1 subcomponents, and then test how essential each component is to the original model’s behavior via adversarial masking. Concretely, they use the decomposition:

Here and are vectors representing the output and input directions, and each subcomponent is assigned a causal importance value indicating how necessary it is for a particular input. Compared to

Stochastic Parameter Decomposition (SPD), VPD goes further by performing adversarial training: using

Projected Gradient Descent (PGD), it searches for a mask that most effectively disrupts the model’s behavior like

Gradient Routing. This helps ensure the identified subcomponents are not merely correlated with activations, but are mechanistically faithful elements required to preserve the model’s behavior.

VPD jointly optimizes five losses (adversarial/stochastic reconstruction, importance/frequency minimality, and a -L2 penalty). In particular, the frequency minimality loss with a superlinear penalty suppresses feature splitting—a chronic issue in activation-based approaches—encouraging subcomponents to learn simple, general patterns. Notably, by decomposing attention layers holistically rather than splitting them head-by-head, VPD can naturally capture complex circuits distributed across multiple heads, such as “syntax-boundary routing” or “previous-token behavior.”

Empirically, VPD can recover 82.4% of the original model’s pretraining compute using only 205 active subcomponents per sequence position (≈2.1% density), strongly Pareto-dominating prior transcoders in the reconstruction–sparsity tradeoff. Moreover, even when increasing subcomponent capacity by 4×, the number of “alive” components stays around ~10k, indicating that feature splitting remains rare.

Paper Summary: Interpreting Language Model Parameters

This post is a summary of our latest paper: Interpreting Language Model Parameters.

https://www.goodfire.ai/research/vpd-explainer

VPD

adVersarial Parameter Decomposition (VPD)

Recommendations