GLP: Learning LLM activation distribution itself with a diffusion model
Activation generative model
- Collected 1 billion residual stream activations
- Trained a flow matching-based diffusion MLP on single-token activations
- In other words, a model that directly learns "what distribution do these layer activations originally form"
Steering Improvement
- Activation steering breaks fluency when applied too strongly, as activations are pushed off-manifold
- GLP adds noise to steered activations and then denoises themprojecting off-manifold activations back onto the manifold
- This produces more natural outputs for the same concept strength in sentiment steering, SAE feature steering, and persona elicitation
Corrects steering with manifold regularization
Also useful as an interpretability feature encoder
- GLP's internal hidden units (meta-neurons)tend to separate concepts better at the single neuron level
- Achieved higher performance in 1-D probing than SAE, raw layer output, and raw MLP neurons
Additional key claims:
- GLP loss follows compute scaling laws
- As diffusion loss decreases
- steering quality improves
- probing performance also improves
- In other words, loss serves as a proxy for downstream utility
arxiv.org
https://arxiv.org/pdf/2602.06964

Seonglae Cho