Bottlenecked Transformer

Transformers continuously accumulate KV cache → preserving past information as-is excessively. From an Information Bottleneck perspective, this makes input information I(X;Z) too large, hindering generalization (reasoning ability). In contrast, RNNs recompress information at each step, making them stronger at rule-based reasoning.

Transformers can also mimic the human brain: noting that Consolidation (stabilizing new memories) and Reconsolidation (retrieving and modifying past memories) can improve reasoning ability. Insert a module that directly rewrites the KV cache:

Cache Processor role: selectively reprocesses KV "at the end of recently generated reasoning steps." The Processor activates when the model ends a reasoning step with a newline (\n). The Processor performs in-place rewrite on two parts:

Entire KV of the recent step (consolidation)

Past top-k KVs that the recent step strongly attended to (reconsolidation)

The Bottlenecked Transformer rewrites the KV cache like memory, reducing unnecessary information and strengthening only prediction-critical information. It performs a latent-space 'cleanup' process at each reasoning step, significantly enhancing mathematical and reasoning capabilities.

arxiv.org

https://arxiv.org/pdf/2505.16950

Bottlenecked Transformer

Recommendations