DLA

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2026 Jan 9 15:56
Editor
Edited
Edited
2026 Jan 9 15:58
Refs
Refs

Direct Logit Attribution

Unlike SAE, which decomposes representations, this is a component-level activation decomposition for logit-space attribution that breaks down the component outputs (embeddings, attention/MLP outputs from each layer) added to the final residual stream
 
 
 
limitation: DLA can be misinterpreted as causal (this is not a
Causal abstraction
)
 

Recommendations