In-context learning ability

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2024 Apr 18 10:1
Editor
Edited
Edited
2024 Apr 19 9:25

In-context learning ability

During this
Transformer Phase Change
, the majority of in-context learning ability (as measured by difference in loss between tokens early and late in the sequence) is acquired, and simultaneously induction heads form within the model that are capable of implementing fairly abstract and fuzzy versions of pattern completion.
notion image

In-context learning score

the loss of the 500th token in the context minus the average loss of the 50th token in the context, averaged over dataset examples.
One might wonder if the sudden increase is somehow an artifact of the choice to define in-context learning in terms of the difference between the 500th and 50th tokens. An easy way to see that this is a robust phenomenon is to look at the derivative of loss with respect to the logarithm token index in context. You can think of this as measuring something like "in-context learning per ε% increase in context length." We can visualize this on a 2D plot, where one axis is the amount of training that has elapsed, the other is the token index being predicted. Before the phase change, loss largely stops improving around token 50, but after the phase change, loss continues to improve past that point.
notion image

Prefix matching score

Anthropic go through the attention heads of a model and and score them for whether they are induction heads (using a prefix matching score which measures their ability to perform the task we used to define
Induction head

Induction head Properties

  • Prefix matching: The head attends back to previous tokens that were followed by the current and/or recent tokens. That is, it attends to the token which induction would suggest comes next.
  • Copying: The head’s output increases the logit corresponding to the attended-to token.
notion image
notion image
This already strongly suggests some connection between induction heads and in-context learning, but beyond just that, it appears this window is a pivotal point for the training process in general: whatever's occurring is visible as a bump on the training curve (figure above). It is in fact the only place in training where the loss is not convex (monotonically decreasing in slope).
That might not sound significant, but the loss curve is averaging over many thousands of tokens. Many behaviors people find interesting in language models, such as the emergence of arithmetic, would be microscopic on the loss curve.
 
 

In-context learning scoring

  • the loss at different token indices (macro perspective focusing on average correlates with tasks)
notion image
The main function of the
Induction head
, Pattern matching, is represented by
Few shot learning
.
 
 
 

Per-Token Loss Analysis

To better understand how models evolve during training, we analyze what we call the "per-token loss vectors." The core idea traces back to a method, and more generally to the idea of "function spaces" in mathematics.
It allows Anthropic to summarize the main dimensions of variation in how several models' predictions vary over the course of training.
notion image
Anthropic also apply principal components analysis (PCA) to the per-token losses, which allows to summarize the main dimensions of variation in how several models' predictions vary over the course of training. This means that the number of principal components has increased. This appears when you want to better capture the complexity and diversity of information in the dataset. This can be interpreted as the model considering a wider range of features when choosing tokens. In other words, it means that the model considers more factors when evaluating the importance of a specific token.
notion image
If a sequence of tokens occurs multiple times, the model is better at predicting the sequence the second time it shows up. On the other hand, if a token is followed by a different token than it previously was, the post-phase-change model is worse at predicting it.
notion image
Anthropic broke the loss curve apart and look at the loss curves for individual tokens.
notion image
 
 

Gradient

The gradient descent used in the backpropagation operation when training the model and the matrix operation performed in the attention layer of the transformer in the language model during inference are mathematically similar.
 
 
 
 

Reverse engineering with Induction head in
Multi-head Attention

Stanford CS25: V1 I Transformer Circuits, Induction Heads, In-Context Learning
"Neural network parameters can be thought of as compiled computer programs. Somehow, they encode sophisticated algorithms, capable of things no human knows how to write a computer program to do. Mechanistic interpretability seeks to reverse engineer neural networks into human understandable algorithms. Previous work has tended to focus on vision models; this talk will explore how we might reverse engineer transformer language models.  In particular, we'll focus on what we call ""induction head circuits"", a mechanism that appears to be significantly responsible for in-context learning. Using a pair of attention heads, these circuits allow models to repeat text from earlier in the context, translate text seen earlier, mimic functions from examples earlier in the context, and much more. The discovery of induction heads in the learning process appears to drive a sharp phase change, creating a bump in the loss curve, pivoting models learning trajectories, and greatly increasing their capacity for in-context learning, in the span of just a few hundred training steps." Chris Olah is a co-founder of Anthropic, an AI company focused on the safety of large models, where he leads Anthropic's interpretability efforts. Previously, Chris led OpenAI's interpretability team, and was a researcher at Google Brain. Chris' work includes the Circuits project, his blog (especially his tutorial on LSTMs), the Distill journal, and DeepDream. View the entire CS25 Transformers United playlist: https://www.youtube.com/playlist?list=PLoROMvodv4rNiJRchCzutFw5ItR_Z27CM
Stanford CS25: V1 I Transformer Circuits, Induction Heads, In-Context Learning

Insight with pre training and attention mechanism

Are Emergent Abilities in Large Language Models just In-Context Learning?
 
 
 

Recommendations