Non‑linear Representation Dilemma

densutter • Updated 2025 Jul 18 0:1

The dilemma that if we allow the assumption that "representations can be encoded non-linearly," Causal Abstraction itself becomes meaningless as an interpretive tool. Without additional assumptions like the Linear Representation Hypothesis (which enforces linear alignment), mechanistic interpretability cannot be guaranteed. When generalizing the Causal Abstraction concept by removing linear constraints on the alignment map φ and allowing arbitrary non-linear functions, it theoretically shows that any DNN can perfectly match any algorithm, making it impossible to identify "which algorithm the model actually implements." Using non-linear φ with limited complexity can provide meaningful abstractions. However, the key point is that it only becomes vacuous (meaningless) when "any φ" is allowed. For non-linear φ to have interpretability, one must specify "which non-linear family V to choose and why there are grounds to believe in it."

arxiv.org

https://arxiv.org/pdf/2507.08802

Non‑linear Representation Dilemma

Recommendations