Transformer Phase Change

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2024 Apr 18 8:58
Editor
Edited
Edited
2024 Apr 21 15:1
Refs
Refs
In particular, the phase change we observe forms an interesting potential bridge between the microscopic domain of interpretability and the macroscopic domain of scaling laws and learning dynamics.
notion image
If we make it sufficiently sparse, there's a phase change, and it collapses from a pentagon to a pair of digons with the sparser point at zero. The phase change corresponds to loss curves corresponding to the two different geometries crossing over
A more complicated form of non-uniform superposition occurs when there are correlations between features. This seems essential for understanding superposition in the real world, where many features are correlated or anti-correlated.
notion image

Phase change for
In-context learning

Induction heads may be the mechanistic source of general in-context learning in transformer models of any size
Phase change occurs early in training for language models of every size (provided they have more than one layer), and which is visible as a bump in the training loss. During this phase change, the majority of in-context learning ability (as measured by difference in loss between tokens early and late in the sequence) is acquired, and simultaneously induction heads form within the model that are capable of implementing fairly abstract and fuzzy versions of pattern completion.
 

What derives Phase change

First, the window where the phase change happens doesn’t appear to correspond to a scheduled change in learning rate, warmup, or weight decay; there is not some known exogenous factor precipitating everything. Second, we tried out training some of the small models on a different dataset, and we observed the phase change develop in the same way.
 
 

AI Feature Dimensionality

Is there a way we could understand what "fraction of a dimension" a specific feature gets?
notion image
Perhaps the most striking phenomenon the Anthropic have noticed is that the learning dynamics of toy models with large numbers of features appear to be dominated by "energy level jumps" where features jump between different feature dimensionalities.
notion image
 
 

In-context learning ability

During this
Transformer Phase Change
, the majority of in-context learning ability (as measured by difference in loss between tokens early and late in the sequence) is acquired, and simultaneously induction heads form within the model that are capable of implementing fairly abstract and fuzzy versions of pattern completion.
notion image

In-context learning score

Tqqqqqqqqqqqqqqqqqqqqhe loss of the 500th token in the context minus the average loss of the 50th token in the context, averaged over dataset examples.
One might wonder if the sudden increase is somehow an artifact of the choice to define in-context learning in terms of the difference between the 500th and 50th tokens. An easy way to see that this is a robust phenomenon is to look at the derivative of loss with respect to the logarithm token index in context. You can think of this as measuring something like "in-context learning per ε% increase in context length." We can visualize this on a 2D plot, where one axis is the amount of training that has elapsed, the other is the token index being predicted. Before the phase change, loss largely stops improving around token 50, but after the phase change, loss continues to improve past that point.
notion image

Prefix matching score

Anthropic go through the attention heads of a model and and score them for whether they are induction heads (using a prefix matching score which measures their ability to perform the task we used to define
Induction head

Induction head Properties

  • Prefix matching: The head attends back to previous tokens that were followed by the current and/or recent tokens. That is, it attends to the token which induction would suggest comes next.
  • Copying: The head’s output increases the logit corresponding to the attended-to token.
notion image
notion image
This already strongly suggests some connection between induction heads and in-context learning, but beyond just that, it appears this window is a pivotal point for the training process in general: whatever's occurring is visible as a bump on the training curve (figure above). It is in fact the only place in training where the loss is not convex (monotonically decreasing in slope). The more interesting connection is that this figure explains
Deep double descent
really well.
That might not sound significant, but the loss curve is averaging over many thousands of tokens. Many behaviors people find interesting in language models, such as the emergence of arithmetic, would be microscopic on the loss curve.
 
 
 
 
 

Reverse engineering with Induction head in
Multi-head Attention

 
 

Recommendations