Autoregressive Model

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2023 Nov 20 14:33
Editor
Edited
Edited
2025 Dec 5 11:30

Abstractive
Non-Deterministic Turing Machine
with probability

Autoregression is a statistical technique that uses past values to predict future values in a time series. It's a regression of a variable against itself.
The confusing terminology is that Masked Attention is considered opposite to
Bidirectional LM
because information about tokens after the current token is not considered. The key point is unidirectionality/bidirectionality
Autoregressive Model Notion
 
 
 
 
Decoder-only LLM to Encoder-Decoder Transformer
Injective fucntion
to final hidden state

LLM

Representational Collapse (2024) - Transformers need glasses!

notion image
Transformer compresses an input sequence of length n (each token embedding dimension) into a single fixed d-dimensional residual stream
Proves that different input sequences become nearly identical in the representation space of the last token. From a
GNN
perspective of Transformers, information from early tokens is compressed and lost through
Autoregressive Model
's path bottlenecks,
Oversquashing
. While copying tokens from the beginning is easy, copying from the end is difficult. Proves that Softmax normalization removes absolute information about sequence length, making accurate counting fundamentally impossible. Position encoding and causal masks partially mitigate this, but fundamental limitations remain.
When low-precision floating-point arithmetic is used, this phenomenon rapidly worsens, causing copying or counting tasks to fail. In Transformers, the actual root cause of the information bottleneck is the dimensional bottleneck of the representation space, not float precision, which is unrelated to oversquashing. Even with infinite floating-point precision, when degrees of freedom differ, this does not help solve oversquashing. This is a result of ignoring dimensional issues and normalization.
When the graph structure of the input sequence is a causal mask and we draw a DAG defining the information flow between input tokens, information sinks to the last token. Analyzing the information flow of feedforward DAGs from a graph theory perspective, the paper argues that there exists a better graph structure than fully-connected.
notion image
Defining two criteria, mixing time + fidelity, the paper shows that the FS-graph found by FunSearch is more sparse than fully-connected while still transmitting information better.
 
 

Recommendations