Bottlenecks on graphs via curvature
Oversquashing is a topological bottleneck due to the low curvature volume; geometric distortion.
When information from a distant node reaches node , the amount of information transmitted decreases exponentially according to the graph's curvature.
At this point, the Determinant of the Jacobian Matrix creates a bottleneck.
vanishing gradients, over-smoothing, and over-squashing all stem from the same cause, as demonstrated theoretically and experimentally in this research. GNNs suffer from much more severe gradient vanishing than RNNs due to the spectral contractive properties of the normalized adjacency matrix. This gradient vanishing is the root cause of over-smoothing, leading to a "fixed-point convergence" phenomenon where all node representations converge to zero.
LLM
Representational Collapse (2024) - Transformers need glasses!

Transformer compresses an input sequence of length n (each token embedding dimension) into a single fixed d-dimensional residual stream
Proves that different input sequences become nearly identical in the representation space of the last token. From a GNN perspective of Transformers, information from early tokens is compressed and lost through Autoregressive Model's path bottlenecks, Oversquashing. While copying tokens from the beginning is easy, copying from the end is difficult. Proves that Softmax normalization removes absolute information about sequence length, making accurate counting fundamentally impossible. Position encoding and causal masks partially mitigate this, but fundamental limitations remain.
When low-precision floating-point arithmetic is used, this phenomenon rapidly worsens, causing copying or counting tasks to fail. In Transformers, the actual root cause of the information bottleneck is the dimensional bottleneck of the representation space, not float precision, which is unrelated to oversquashing. Even with infinite floating-point precision, when degrees of freedom differ, this does not help solve oversquashing. This is a result of ignoring dimensional issues and normalization.
When the graph structure of the input sequence is a causal mask and we draw a DAG defining the information flow between input tokens, information sinks to the last token. Analyzing the information flow of feedforward DAGs from a graph theory perspective, the paper argues that there exists a better graph structure than fully-connected.

Defining two criteria, mixing time + fidelity, the paper shows that the FS-graph found by FunSearch is more sparse than fully-connected while still transmitting information better.

Seonglae Cho