Repeated Token Phenomenon

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Aug 1 16:33
Editor
Edited
Edited
2025 Nov 3 18:41
Refs

Known limitation of the transformer architecture

The explanation that token repetition cycles causing divergence is related to the
Attention Sink
. The first layer fails to distinguish between the first token and repeated identical tokens, incorrectly marking repeated tokens as sinks, which creates abnormal attention and leads to repeated token divergence

Repeated token divergence

when asked to repeatedly generate the same token, the model continues for a while then diverges to different tokens. As the repetition length increases, the representation of the last repeated token converges to the representation of a single token sequence.

Divergence attack

Aligned models are designed to respond only in conversational format, so they typically don't output training data through normal methods. However, single token repetition internally becomes similar to the attention query vector of the BOS token, causing the model to misidentify it as "the start of a new document" → ignoring alignment rules and sampling from the original language model distribution. The model falls into a BOS-like subspace by itself, which differs from stably providing BOS and creates a collapse that reverts to pretrain behavior.

Attention Sink interpretation

Process of attention sink occurrence at tokens that deviate from the pattern: The first attention layer (which is considered the origin circuit and analyzed as such) begins to misidentify (mark) each "the" as if it were BoS using the BoS detection circuit. At this point, sink neurons become increasingly activated, causing the hidden-state norm to grow excessively. When saturation occurs and the circuit becomes overloaded, it produces abnormal values and starts generating nonsense as if a new context has begun.
However, not all attention heads produce sinks when repeatedly stimulated; only tokens that activate attention heads containing the BoS detection circuit can trigger this effect. For example, tokens like "the", "a", "in", and "of" are patterns frequently attended to by BoS-heads, so repeating them overstimulates the sink circuit.

Cluster Attack

Not only during generation, but the same sink activation and confusion occurs when repeated tokens are given as input. Cluster attack: Even without exact repetition, it's possible to induce the same collapse by repeatedly placing similar token sets (clusters) that trigger the same attention heads. There's a set of tokens that the same attention head frequently attends to, and when tokens within this cluster are repeatedly placed, the same attention head is repeatedly activated, eventually causing that circuit to behave like a first token mark circuit.
 
 
 

Divergence attack (2023)

Repeated Token Phenomenon
to extract
Pretraining Dataset

Even modern large LLMs like ChatGPT allow extraction of training data (including PII) through simple prompts, and current alignment and safety techniques fundamentally fail to solve the memorization problem.

Representational Collapse (2024)

Transformer compresses an input sequence of length n (each token embedding dimension) into a single fixed d-dimensional residual stream
Proves that different input sequences become nearly identical in the representation space of the last token. From a
GNN
perspective of Transformers, information from early tokens is compressed and lost through
Autoregressive Model
's path bottlenecks,
Oversquashing
. While copying tokens from the beginning is easy, copying from the end is difficult. Proves that Softmax normalization removes absolute information about sequence length, making accurate counting fundamentally impossible. Position encoding and causal masks partially mitigate this, but fundamental limitations remain.
When low-precision floating-point arithmetic is used, this phenomenon rapidly worsens, causing copying or counting tasks to fail. In Transformers, the actual root cause of the information bottleneck is the dimensional bottleneck of the representation space, not float precision, which is unrelated to oversquashing. Even with infinite floating-point precision, when degrees of freedom differ, this does not help solve oversquashing. This is a result of ignoring dimensional issues and normalization.

Cluster Attack (2025)

 
 

Recommendations