DLA

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2026 Jan 9 15:56
Editor
Edited
Edited
2026 Jan 9 15:58
Refs
Refs

Direct Logit Attribution

Unlike SAE, which decomposes representations, this is a component-level activation decomposition for logit-space attribution that breaks down the component outputs (embeddings, attention/MLP outputs from each layer) added to the final residual stream
 
 
 
Direct Logit Attribution
This post explains Direct Logit Attribution and Logit Lens: key tools in the initial mechanistic investigation of transformer behaviour.
limitation: DLA can be misinterpreted as causal (this is not a
Causal abstraction
)
arxiv.org
 

Recommendations