Generative Latent Prior

GLP: Learning LLM activation distribution itself with a diffusion model

Activation generative model, not activation decomposition. The key differentiator is that it deliberately avoids decomposition. GLP is density estimation rather than decomposition: instead of splitting an activation into components, it uses flow matching to learn the distribution that these activations live in. So the output is not a factorization like “this activation is feature A 0.3 + feature B 0.7,” but rather something like “how likely is this activation to lie on the true LLM activation manifold?” or “if we re-project this manipulated activation back onto the manifold, where does it land?”

Collected 1 billion residual stream activations

Trained a
Flow Matching-based diffusion MLP on single-token activations

In other words, a model that directly learns "what distribution do these layer activations originally form"

forward pass:

Steering Improvement

Activation steering breaks fluency when applied too strongly, as activations are pushed off-manifold

GLP adds noise to steered activations and then denoises them projecting off-manifold activations back onto the manifold

This produces more natural outputs for the same concept strength in sentiment steering, SAE feature steering, and persona elicitation

Corrects steering with manifold regularization

Also useful as an interpretability feature encoder

GLP's internal hidden units (meta-neurons)tend to separate concepts better at the single neuron level

Achieved higher performance in 1-D probing than SAE, raw layer output, and raw MLP neurons

Additional key claims:

GLP loss follows compute scaling laws

As diffusion loss decreases

steering quality improves
probing performance also improves

In other words, loss serves as a proxy for downstream utility

arxiv.org

https://arxiv.org/pdf/2602.06964

Generative Latent Prior

GLP: Learning LLM activation distribution itself with a diffusion model

Recommendations