There is significant empirical evidence suggesting that neural networks have interpretable linear directions in activation space.
Towards interpretable gpt2
2013
2022 Residual Stream
NIPS 2023 Kiho Park Word Embedding
Euclidean inner product
When vectors are randomly placed in high-dimensional space, they tend to be mostly orthogonal to each other, which may not accurately reflect their semantic relationships
Causal Inner product
Redefining inner product so that causally separated (independent) features are orthogonal to each other
2023 Residual Stream linearity evidence
Multidimensional feature that lives in subspaces of greater than one dimension is not sufficient to justify non-linear representations.