LRH
There is significant empirical evidence suggesting that neural networks have interpretable linear directions in activation space.
LRH Notion
Towards interpretable gpt2
2013
2022 Residual Stream
Word Embedding
2023 Residual Stream linearity evidence
Multidimensional feature that lives in subspaces of greater than one dimension is not sufficient to justify non-linear representations.


COLM 2024 – The Geometry of Truth
LLMs represent factuality (True/False) linearly in their internal representations. Larger models exhibit a more distinct and generalizable 'truth direction'.

Seonglae Cho