Linear Representation Hypothesis

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2024 May 24 4:19
Editor
Edited
Edited
2025 Oct 29 12:30

LRH

There is significant empirical evidence suggesting that neural networks have interpretable linear directions in activation space.
Towards interpretable gpt2
LRH Notion
 
 
 

2013

2022
Residual Stream

Word Embedding

2023
Residual Stream
linearity evidence

ICML
2024 workshop
ICLR
2025 Kiho Park
Gemma
with
Simplex
representation space
Multidimensional feature that lives in subspaces of greater than one dimension is not sufficient to justify non-linear representations.
notion image
notion image
 
 

Recommendations