Linear Representation Hypothesis

LRH

There is significant empirical evidence suggesting that neural networks have interpretable linear directions in activation space.

Towards interpretable gpt2

2013

arxiv.org

https://arxiv.org/pdf/1301.3781

2022
Residual Stream

Toy Models of Superposition

It would be very convenient if the individual neurons of artificial neural networks corresponded to cleanly interpretable features of the input. For example, in an “ideal” ImageNet classifier, each neuron would fire only in the presence of a specific visual feature, such as the color red, a left-facing curve, or a dog snout. Empirically, in models we have studied, some of the neurons do cleanly map to features. But it isn't always the case that features correspond so cleanly to neurons, especially in large language models where it actually seems rare for neurons to correspond to clean features. This brings up many questions. Why is it that neurons sometimes align with features and sometimes don't? Why do some models and tasks have many of these clean neurons, while they're vanishingly rare in others?

https://transformer-circuits.pub/2022/toy_model/index.html

Word Embedding

Euclidean inner product

When vectors are randomly placed in high-dimensional space, they tend to be mostly orthogonal to each other, which may not accurately reflect their semantic relationships

Causal Inner product

Redefining inner product so that causally separated (independent) features are orthogonal to each other

The Linear Representation Hypothesis and the Geometry of Large...

Informally, the 'linear representation hypothesis' is the idea that high-level concepts are represented linearly as directions in some representation space. In this paper, we address two closely...

https://openreview.net/forum?id=T0PoOJg8cK&noteId=hIzVPo3wws

2023
Residual Stream linearity evidence

arxiv.org

https://arxiv.org/pdf/2310.02207

ICML 2024 workshop

ICLR 2025 Kiho Park

Gemma with

Simplex representation space

The Geometry of Categorical and Hierarchical Concepts in Large...

Understanding how semantic meaning is encoded in the representation spaces of large language models is a fundamental problem in interpretability. In this paper, we study the two foundational...

https://openreview.net/forum?id=KXuYjuBzKo

Multidimensional feature that lives in subspaces of greater than one dimension is not sufficient to justify non-linear representations.

Circuits Updates - July 2024

We report a number of developing ideas on the Anthropic interpretability team, which might be of interest to researchers working actively in this space. Some of these are emerging strands of research where we expect to publish more on in the coming months. Others are minor points we wish to share, since we're unlikely to ever write a paper about them.

https://transformer-circuits.pub/2024/july-update/index.html#linear-representations

Linear Representation Hypothesis