Autoregressive Model

Abstractive
Non-Deterministic Turing Machine with probability

Autoregression is a statistical technique that uses past values to predict future values in a time series. It's a regression of a variable against itself.

The confusing terminology is that Masked Attention is considered opposite to

Bidirectional LM because information about tokens after the current token is not considered. The key point is unidirectionality/bidirectionality

Autoregressive Model Notion

Decoding Strategy

Multilevel Autoregressive Modeling

Continuous Autoregressive Model

Maxime Labonne - Decoding Strategies in Large Language Models

A Guide to Text Generation From Beam Search to Nucleus Sampling

https://mlabonne.github.io/blog/posts/2022-06-07-Decoding_strategies.html

Decoder-only LLM to Encoder-Decoder Transformer

arxiv.org

https://arxiv.org/pdf/2504.06225

Injective fucntion to final hidden state

arxiv.org

https://arxiv.org/pdf/2510.15511

LLM

Representational Collapse (2024) - Transformers need glasses!

Transformer compresses an input sequence of length n (each token embedding dimension) into a single fixed d-dimensional residual stream

Proves that different input sequences become nearly identical in the representation space of the last token. From a

GNN perspective of Transformers, information from early tokens is compressed and lost through

Autoregressive Model's path bottlenecks,

Oversquashing. While copying tokens from the beginning is easy, copying from the end is difficult. Proves that Softmax normalization removes absolute information about sequence length, making accurate counting fundamentally impossible. Position encoding and causal masks partially mitigate this, but fundamental limitations remain.

When low-precision floating-point arithmetic is used, this phenomenon rapidly worsens, causing copying or counting tasks to fail. In Transformers, the actual root cause of the information bottleneck is the dimensional bottleneck of the representation space, not float precision, which is unrelated to oversquashing. Even with infinite floating-point precision, when degrees of freedom differ, this does not help solve oversquashing. This is a result of ignoring dimensional issues and normalization.

arxiv.org

https://arxiv.org/pdf/2406.04267

When the graph structure of the input sequence is a causal mask and we draw a DAG defining the information flow between input tokens, information sinks to the last token. Analyzing the information flow of feedforward DAGs from a graph theory perspective, the paper argues that there exists a better graph structure than fully-connected.

Defining two criteria, mixing time + fidelity, the paper shows that the FS-graph found by FunSearch is more sparse than fully-connected while still transmitting information better.

arxiv.org

https://arxiv.org/pdf/2502.06751

Autoregressive Model

Abstractive
Non-Deterministic Turing Machine with probability

LLM

Representational Collapse (2024) - Transformers need glasses!

Backlinks

Recommendations

Autoregressive Model

Abstractive Non-Deterministic Turing Machine with probability

LLM

Representational Collapse (2024) - Transformers need glasses!

Backlinks

Recommendations

Abstractive
Non-Deterministic Turing Machine with probability