VPD

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2026 Jun 25 15:30
Editor
Edited
Edited
2026 Jun 25 18:27
Refs
Refs

adVersarial Parameter Decomposition (VPD)

Activation-based interpretability methods such as SAEs or Transcoders have a fundamental limitation: because their functional form differs from the original model, precise model editing is difficult, and reliability can degrade in out-of-distribution (OOD) settings. To address this “functional form mismatch,” the authors reconstruct the model’s mechanisms directly in parameter space as a sum of rank-1 subcomponents, enabling analysis of the “code” the model actually uses for a given input.
The core idea of VPD is to decompose a weight matrix into a sum of rank-1 subcomponents, and then test how essential each component is to the original model’s behavior via adversarial masking. Concretely, they use the decomposition:
Here and are vectors representing the output and input directions, and each subcomponent is assigned a causal importance value indicating how necessary it is for a particular input. Compared to
Stochastic Parameter Decomposition
(SPD), VPD goes further by performing adversarial training: using
Projected Gradient Descent
(PGD), it searches for a mask that most effectively disrupts the model’s behavior like
Gradient Routing
. This helps ensure the identified subcomponents are not merely correlated with activations, but are mechanistically faithful elements required to preserve the model’s behavior.
VPD jointly optimizes five losses (adversarial/stochastic reconstruction, importance/frequency minimality, and a -L2 penalty). In particular, the frequency minimality loss with a superlinear penalty suppresses feature splitting—a chronic issue in activation-based approaches—encouraging subcomponents to learn simple, general patterns. Notably, by decomposing attention layers holistically rather than splitting them head-by-head, VPD can naturally capture complex circuits distributed across multiple heads, such as “syntax-boundary routing” or “previous-token behavior.”
Empirically, VPD can recover 82.4% of the original model’s pretraining compute using only 205 active subcomponents per sequence position (≈2.1% density), strongly Pareto-dominating prior transcoders in the reconstruction–sparsity tradeoff. Moreover, even when increasing subcomponent capacity by 4×, the number of “alive” components stays around ~10k, indicating that feature splitting remains rare.
 
 
Paper Summary: Interpreting Language Model Parameters
This post is a summary of our latest paper: Interpreting Language Model Parameters.
Paper Summary: Interpreting Language Model Parameters
 
 

Recommendations