Known limitation of the transformer architecture
The explanation that token repetition cycles causing divergence is related to the Attention Sink. The first layer fails to distinguish between the first token and repeated identical tokens, incorrectly marking repeated tokens as sinks, which creates abnormal attention and leads to repeated token divergence
Repeated token divergence
when asked to repeatedly generate the same token, the model continues for a while then diverges to different tokens. As the repetition length increases, the representation of the last repeated token converges to the representation of a single token sequence.
Divergence attack
Aligned models are designed to respond only in conversational format, so they typically don't output training data through normal methods. However, single token repetition internally becomes similar to the attention query vector of the BOS token, causing the model to misidentify it as "the start of a new document" → ignoring alignment rules and sampling from the original language model distribution. The model falls into a BOS-like subspace by itself, which differs from stably providing BOS and creates a collapse that reverts to pretrain behavior.
Attention Sink interpretation
Process of attention sink occurrence at tokens that deviate from the pattern: The first attention layer (which is considered the origin circuit and analyzed as such) begins to misidentify (mark) each "the" as if it were BoS using the BoS detection circuit. At this point, sink neurons become increasingly activated, causing the hidden-state norm to grow excessively. When saturation occurs and the circuit becomes overloaded, it produces abnormal values and starts generating nonsense as if a new context has begun.
However, not all attention heads produce sinks when repeatedly stimulated; only tokens that activate attention heads containing the BoS detection circuit can trigger this effect. For example, tokens like "the", "a", "in", and "of" are patterns frequently attended to by BoS-heads, so repeating them overstimulates the sink circuit.
Cluster Attack
Not only during generation, but the same sink activation and confusion occurs when repeated tokens are given as input. Cluster attack: Even without exact repetition, it's possible to induce the same collapse by repeatedly placing similar token sets (clusters) that trigger the same attention heads. There's a set of tokens that the same attention head frequently attends to, and when tokens within this cluster are repeatedly placed, the same attention head is repeatedly activated, eventually causing that circuit to behave like a first token mark circuit.
Divergence attack (2023)
Repeated Token Phenomenon to extract Pretraining Dataset
Even modern large LLMs like ChatGPT allow extraction of training data (including PII) through simple prompts, and current alignment and safety techniques fundamentally fail to solve the memorization problem.
Representational Collapse (2024) - Transformers need glasses!
Transformer compresses an input sequence of length n (each token embedding dimension) into a single fixed d-dimensional residual stream
Proves that different input sequences become nearly identical in the representation space of the last token. From a GNN perspective of Transformers, information from early tokens is compressed and lost through Autoregressive Model's path bottlenecks, Oversquashing. While copying tokens from the beginning is easy, copying from the end is difficult. Proves that Softmax normalization removes absolute information about sequence length, making accurate counting fundamentally impossible. Position encoding and causal masks partially mitigate this, but fundamental limitations remain.
When low-precision floating-point arithmetic is used, this phenomenon rapidly worsens, causing copying or counting tasks to fail. In Transformers, the actual root cause of the information bottleneck is the dimensional bottleneck of the representation space, not float precision, which is unrelated to oversquashing. Even with infinite floating-point precision, when degrees of freedom differ, this does not help solve oversquashing. This is a result of ignoring dimensional issues and normalization.

Seonglae Cho