Linear representation hypothesis

Created
Created
2024 May 24 4:19
Editor
Creator
Creator
Seonglae Cho
Edited
Edited
2025 Mar 10 14:20
There is significant empirical evidence suggesting that neural networks have interpretable linear directions in activation space.
Towards interpretable gpt2
 
 
 

2013

2022
Residual Stream

NIPS
2023 Kiho Park
Word Embedding

Euclidean inner product
When vectors are randomly placed in high-dimensional space, they tend to be mostly orthogonal to each other, which may not accurately reflect their semantic relationships
Causal Inner product
Redefining inner product so that causally separated (independent) features are orthogonal to each other

2023
Residual Stream
linearity evidence

ICML
2024 workshop
ICLR
2025 Kiho Park
Gemma
with
Simplex
representation space
Multidimensional feature that lives in subspaces of greater than one dimension is not sufficient to justify non-linear representations.
notion image
notion image
 
 

Recommendations