Linear Representation Hypothesis

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2024 May 24 4:19
Editor
Edited
Edited
2025 Nov 21 23:0

LRH

There is significant empirical evidence suggesting that neural networks have interpretable linear directions in activation space.
LRH Notion
 
 
 
Towards interpretable gpt2

2013

2022
Residual Stream

Word Embedding

2023
Residual Stream
linearity evidence

ICML
2024 workshop
ICLR
2025 Kiho Park
Gemma
with
Simplex
representation space
Multidimensional feature that lives in subspaces of greater than one dimension is not sufficient to justify non-linear representations.
notion image
notion image
COLM 2024 – The Geometry of Truth
LLMs represent factuality (True/False) linearly in their internal representations. Larger models exhibit a more distinct and generalizable 'truth direction'.
 
 

Recommendations