Residual Stream

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2024 Mar 31 15:22
Editor
Edited
Edited
2024 Oct 14 11:26
Residual Stream is thought as a communication channel, since it doesn't do any processing itself and all layers communicate through it. The residual stream has a deeply linear structure without privileged basis. we could rotate it by rotating all the matrices interacting with it, without changing model behavior.
Once added, information persists in a subspace unless another layer actively deletes it. From this perspective, dimensions of the residual stream become something like "memory" or "bandwidth”. The residual stream is high dimensional, and can be divided into different
Subspace
s. Layers can interact by writing to and reading from the same or overlapping subspaces. If they write to and read from
Disjoint vector space
, they will not interact. Typically the spaces only partially overlap. Layers can delete the residual stream by reading in a subspace and then writing the negative version.

Bottleneck activation

We say that an activation is a bottleneck activation if it is a lower-dimensional intermediate between two higher dimensional activations. For example, the residual stream is a bottleneck activation because it is the only way to pass information between MLP activations, which are typically four times larger than it.
For example, at layer 25 of a 50 layer transformer, the residual stream has 100 times more neurons as it has dimensions before it, trying to communicate with 100 times as many neurons as it has dimensions after it, somehow communicating in
Superposition Hypothesis
! We call tensors like this bottleneck activation. Similarly, a value vector is a bottleneck activation because it’s much lower dimensional than the residual stream, and it’s the only way to move information from the residual stream for one token position in the context to another.
Perhaps because of this high demand on residual stream bandwidth, we've seen hints that some MLP neurons and attention heads may perform a kind of "memory management" role, clearing residual stream dimensions set by other layers by reading in information and writing out the negative version. (This is one theory)

The capital of France is → Attention/FFN activated capitals/cities

Subvalue means activation in this context not weight
  1. The distribution change of residual connections in vocabulary space caused by a direct addition function on before-softmax values.
  1. Log probability increase as contribution score could help locate important subvalues.
  1. Attention/FFN subvalues on previous layers are direct “queries” to activate upper FFN subvalues by inner products.
 

Virtual weights

These virtual weights are the product of the output weights of one layer with the input weights.
notion image
notion image
 
 

Privileged basis

Some dimensions show higher importance or activity than expected, appearing more Privileged than other dimensions. The basis directions should be in some sense "arbitrary" and no more likely to encode information than random directions). Recent work has shown that this observation is false in practice. For the computation inside of each layer, we observe heavy-tailed activations inside the computation basis, but not in the residual stream.
The design of the transformer does not favor specific dimensions in information encoding, so such privilege-based discoveries contradict the assumption of uniformity in the distribution of information within the model.
Optimizers like the
Adam Optimizer
that manage momentum by dimension can make certain dimensions learn faster than others.
When Anthropic train a Transformer with a different basis for the residual stream and for the computation inside of each layer, Anthropic observe heavy-tailed activations inside the computation basis, but not in the residual stream.
Anthropic explore two other obvious sources of basis dependency in a Transformer: Layer normalization, and finite-precision floating-point calculations. Anthropic confidently rule these out as being the source of the observed basis-alignment.
However, Transformers do not rely on a privileged basis to train and function properly, even when floating-point precision is taken into account
 

Viewer

Very briefly, the tool lets you see what the dot product of the residual stream at each token is with a particular direction.
https://tinyurl.com/resid-viewer
 
 
 
2023 July
2023 March
2023 Dec
 
 

Recommendations