The dilemma that if we allow the assumption that "representations can be encoded non-linearly," Causal Abstraction itself becomes meaningless as an interpretive tool. Without additional assumptions like the Linear Representation Hypothesis (which enforces linear alignment), mechanistic interpretability cannot be guaranteed. When generalizing the Causal Abstraction concept by removing linear constraints on the alignment map φ and allowing arbitrary non-linear functions, it theoretically shows that any DNN can perfectly match any algorithm, making it impossible to identify "which algorithm the model actually implements." Using non-linear φ with limited complexity can provide meaningful abstractions. However, the key point is that it only becomes vacuous (meaningless) when "any φ" is allowed. For non-linear φ to have interpretability, one must specify "which non-linear family V to choose and why there are grounds to believe in it."
Non‑linear Representation Dilemma
Creator
Creator

Created
Created
2025 Jul 1 14:54Editor
Editor

Edited
Edited
2025 Jul 18 0:1Refs
Refs