In-context learning ability
During this Transformer Phase Change, the majority of in-context learning ability (as measured by difference in loss between tokens early and late in the sequence) is acquired, and simultaneously induction heads form within the model that are capable of implementing fairly abstract and fuzzy versions of pattern completion.
In-context learning score
Tqqqqqqqqqqqqqqqqqqqqhe loss of the 500th token in the context minus the average loss of the 50th token in the context, averaged over dataset examples.
One might wonder if the sudden increase is somehow an artifact of the choice to define in-context learning in terms of the difference between the 500th and 50th tokens. An easy way to see that this is a robust phenomenon is to look at the derivative of loss with respect to the logarithm token index in context. You can think of this as measuring something like "in-context learning per ε% increase in context length." We can visualize this on a 2D plot, where one axis is the amount of training that has elapsed, the other is the token index being predicted. Before the phase change, loss largely stops improving around token 50, but after the phase change, loss continues to improve past that point.
Prefix matching score
Anthropic go through the attention heads of a model and and score them for whether they are induction heads (using a prefix matching score which measures their ability to perform the task we used to define Induction head
Induction head Properties
- Prefix matching: The head attends back to previous tokens that were followed by the current and/or recent tokens. That is, it attends to the token which induction would suggest comes next.
- Copying: The head’s output increases the logit corresponding to the attended-to token.
This already strongly suggests some connection between induction heads and in-context learning, but beyond just that, it appears this window is a pivotal point for the training process in general: whatever's occurring is visible as a bump on the training curve (figure above). It is in fact the only place in training where the loss is not convex (monotonically decreasing in slope). The more interesting connection is that this figure explains Deep double descent really well.
That might not sound significant, but the loss curve is averaging over many thousands of tokens. Many behaviors people find interesting in language models, such as the emergence of arithmetic, would be microscopic on the loss curve.
In-context learning scoring
- Few shot learning (micro perspective focusing on specific tasks)
- the loss at different token indices (macro perspective focusing on average correlates with tasks)
The main function of the Induction head, Pattern matching, is represented by Few shot learning.
Per-Token Loss Analysis
To better understand how models evolve during training, we analyze what we call the "per-token loss vectors." The core idea traces back to a method, and more generally to the idea of "function spaces" in mathematics.
It allows Anthropic to summarize the main dimensions of variation in how several models' predictions vary over the course of training.
Anthropic also apply principal components analysis (PCA) to the per-token losses, which allows to summarize the main dimensions of variation in how several models' predictions vary over the course of training. This means that the number of principal components has increased. This appears when you want to better capture the complexity and diversity of information in the dataset. This can be interpreted as the model considering a wider range of features when choosing tokens. In other words, it means that the model considers more factors when evaluating the importance of a specific token.
If a sequence of tokens occurs multiple times, the model is better at predicting the sequence the second time it shows up. On the other hand, if a token is followed by a different token than it previously was, the post-phase-change model is worse at predicting it.
Anthropic broke the loss curve apart and look at the loss curves for individual tokens.
Gradient
The gradient descent used in the backpropagation operation when training the model and the matrix operation performed in the attention layer of the transformer in the language model during inference are mathematically similar.
Reverse engineering with Induction head in Multi-head Attention
Insight with pre training and attention mechanism
Are Emergent Abilities in Large Language Models just In-Context Learning?