Direct Logit Attribution
Unlike SAE, which decomposes representations, this is a component-level activation decomposition for logit-space attribution that breaks down the component outputs (embeddings, attention/MLP outputs from each layer) added to the final residual stream
Direct Logit Attribution
This post explains Direct Logit Attribution and Logit Lens: key tools in the initial mechanistic investigation of transformer behaviour.
https://loganthomson.com/Direct-Logit-Attribution/
limitation: DLA can be misinterpreted as causal (this is not a Causal abstraction)

Seonglae Cho