Successor Head

Confirmed that language models contain specialized “Successor Heads” that map ordinal sequences (e.g., numbers or days of the week) to the next token.

Using four complementary methods—such as ICA (Independent Component Analysis) and weight inspection—these heads predict the next ordinal token with about 80% accuracy.

Successor Heads: Recurring, Interpretable Attention Heads In The Wild

In this work we present successor heads: attention heads that increment tokens with a natural ordering, such as numbers, months, and days. For example, successor heads increment 'Monday' into...

https://arxiv.org/abs/2312.09230

Circuits Updates - September 2024

We report a number of developing ideas on the Anthropic interpretability team, which might be of interest to researchers working actively in this space. Some of these are emerging strands of research where we expect to publish more on in the coming months. Others are minor points we wish to share, since we're unlikely to ever write a paper about them.

https://transformer-circuits.pub/2024/september-update/index.html

Successor Head

Recommendations