DLA

Direct Logit Attribution

Unlike SAE, which decomposes representations, this is a component-level activation decomposition for logit-space attribution that breaks down the component outputs (embeddings, attention/MLP outputs from each layer) added to the final residual stream

Direct Logit Attribution

This post explains Direct Logit Attribution and Logit Lens: key tools in the initial mechanistic investigation of transformer behaviour.

https://loganthomson.com/Direct-Logit-Attribution/

limitation: DLA can be misinterpreted as causal (this is not a

Causal abstraction)

arxiv.org

https://arxiv.org/pdf/2310.07325

DLA

Direct Logit Attribution

Recommendations