Distributed alignment search
LearnOrthogonal Matrix of activation layer to transform activation layer. They use interchange intervention to infer high-level causal abstraction to optimize alignment. It more focuses on distributed representation rather than SAE trying to decompose each into features mono-semantically.
They rotate basis of activation vector to identify high-level causal variable but there is a limit due to the Superposition Hypothesis with same-sized dimension.
Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations
Causal abstraction is a promising theoretical framework for explainable artificial intelligence that defines when an interpretable high-level causal model is...
https://proceedings.mlr.press/v236/geiger24a.html

Seonglae Cho