Linear Representation Hypothesis

Created
Created
2024 May 24 4:19
Editor
Creator
Creator
Seonglae ChoSeonglae Cho
Edited
Edited
2025 Dec 13 18:3

LRH

There is significant empirical evidence suggesting that neural networks have interpretable linear directions in activation space.
LRH Notion
 
 
 

2013 Efficient Estimation of Word Representations in Vector Space (Tomas Mikolov, Google)

arxiv.org
Tomas Mikolov, Microsoft (Vector Comosition: King - man = Queen)
Linguistic Regularities in Continuous Space Word Representations
Tomas Mikolov, Wen-tau Yih, Geoffrey Zweig. Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2013.
Linguistic Regularities in Continuous Space Word Representations

2022
Residual Stream

Toy Models of Superposition
It would be very convenient if the individual neurons of artificial neural networks corresponded to cleanly interpretable features of the input. For example, in an “ideal” ImageNet classifier, each neuron would fire only in the presence of a specific visual feature, such as the color red, a left-facing curve, or a dog snout. Empirically, in models we have studied, some of the neurons do cleanly map to features. But it isn't always the case that features correspond so cleanly to neurons, especially in large language models where it actually seems rare for neurons to correspond to clean features. This brings up many questions. Why is it that neurons sometimes align with features and sometimes don't? Why do some models and tasks have many of these clean neurons, while they're vanishingly rare in others?

Word Embedding

The Linear Representation Hypothesis and the Geometry of Large...
Informally, the 'linear representation hypothesis' is the idea that high-level concepts are represented linearly as directions in some representation space. In this paper, we address two closely...

2023
Residual Stream
linearity evidence

arxiv.org
ICML
2024 workshop
ICLR
2025 Kiho Park
Gemma
with
Simplex
representation space
The Geometry of Categorical and Hierarchical Concepts in Large...
Understanding how semantic meaning is encoded in the representation spaces of large language models is a fundamental problem in interpretability. In this paper, we study the two foundational...
Multidimensional feature that lives in subspaces of greater than one dimension is not sufficient to justify non-linear representations.
notion image
notion image
Circuits Updates - July 2024
We report a number of developing ideas on the Anthropic interpretability team, which might be of interest to researchers working actively in this space. Some of these are emerging strands of research where we expect to publish more on in the coming months. Others are minor points we wish to share, since we're unlikely to ever write a paper about them.
COLM 2024 – The Geometry of Truth
LLMs represent factuality (True/False) linearly in their internal representations. Larger models exhibit a more distinct and generalizable 'truth direction'.
arxiv.org
 
 

Recommendations