Abstractive Non-Deterministic Turing Machine with probability
Autoregression is a statistical technique that uses past values to predict future values in a time series. It's a regression of a variable against itself.
The confusing terminology is that Masked Attention is considered opposite to Bidirectional LM because information about tokens after the current token is not considered. The key point is unidirectionality/bidirectionality
Autoregressive Model Notion
Decoder-only LLM to Encoder-Decoder Transformer
Injective fucntion to final hidden state
LLM
Representational Collapse (2024) - Transformers need glasses!

Transformer compresses an input sequence of length n (each token embedding dimension) into a single fixed d-dimensional residual stream
Proves that different input sequences become nearly identical in the representation space of the last token. From a GNN perspective of Transformers, information from early tokens is compressed and lost through Autoregressive Model's path bottlenecks, Oversquashing. While copying tokens from the beginning is easy, copying from the end is difficult. Proves that Softmax normalization removes absolute information about sequence length, making accurate counting fundamentally impossible. Position encoding and causal masks partially mitigate this, but fundamental limitations remain.
When low-precision floating-point arithmetic is used, this phenomenon rapidly worsens, causing copying or counting tasks to fail. In Transformers, the actual root cause of the information bottleneck is the dimensional bottleneck of the representation space, not float precision, which is unrelated to oversquashing. Even with infinite floating-point precision, when degrees of freedom differ, this does not help solve oversquashing. This is a result of ignoring dimensional issues and normalization.
When the graph structure of the input sequence is a causal mask and we draw a DAG defining the information flow between input tokens, information sinks to the last token. Analyzing the information flow of feedforward DAGs from a graph theory perspective, the paper argues that there exists a better graph structure than fully-connected.

Defining two criteria, mixing time + fidelity, the paper shows that the FS-graph found by FunSearch is more sparse than fully-connected while still transmitting information better.

Seonglae Cho