LRH
There is significant empirical evidence suggesting that neural networks have interpretable linear directions in activation space.
LRH Notion
2013 Efficient Estimation of Word Representations in Vector Space (Tomas Mikolov, Google)
Tomas Mikolov, Microsoft (Vector Comosition: King - man = Queen)
Linguistic Regularities in Continuous Space Word Representations
Tomas Mikolov, Wen-tau Yih, Geoffrey Zweig. Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2013.
https://aclanthology.org/N13-1090/
2022 Residual Stream
Toy Models of Superposition
It would be very convenient if the individual neurons of artificial neural networks corresponded to cleanly interpretable features of the input. For example, in an “ideal” ImageNet classifier, each neuron would fire only in the presence of a specific visual feature, such as the color red, a left-facing curve, or a dog snout. Empirically, in models we have studied, some of the neurons do cleanly map to features. But it isn't always the case that features correspond so cleanly to neurons, especially in large language models where it actually seems rare for neurons to correspond to clean features. This brings up many questions. Why is it that neurons sometimes align with features and sometimes don't? Why do some models and tasks have many of these clean neurons, while they're vanishingly rare in others?
https://transformer-circuits.pub/2022/toy_model/index.html
Word Embedding
The Linear Representation Hypothesis and the Geometry of Large...
Informally, the 'linear representation hypothesis' is the idea that high-level concepts are represented linearly as directions in some representation space. In this paper, we address two closely...
https://openreview.net/forum?id=T0PoOJg8cK¬eId=hIzVPo3wws
2023 Residual Stream linearity evidence
The Geometry of Categorical and Hierarchical Concepts in Large...
Understanding how semantic meaning is encoded in the representation spaces of large language models is a fundamental problem in interpretability. In this paper, we study the two foundational...
https://openreview.net/forum?id=KXuYjuBzKo
Multidimensional feature that lives in subspaces of greater than one dimension is not sufficient to justify non-linear representations.


Circuits Updates - July 2024
We report a number of developing ideas on the Anthropic interpretability team, which might be of interest to researchers working actively in this space. Some of these are emerging strands of research where we expect to publish more on in the coming months. Others are minor points we wish to share, since we're unlikely to ever write a paper about them.
https://transformer-circuits.pub/2024/july-update/index.html#linear-representations
COLM 2024 – The Geometry of Truth
LLMs represent factuality (True/False) linearly in their internal representations. Larger models exhibit a more distinct and generalizable 'truth direction'.

Seonglae Cho