Transformers continuously accumulate KV cache → preserving past information as-is excessively. From an Information Bottleneck perspective, this makes input information I(X;Z) too large, hindering generalization (reasoning ability). In contrast, RNNs recompress information at each step, making them stronger at rule-based reasoning.
Transformers can also mimic the human brain: noting that Consolidation (stabilizing new memories) and Reconsolidation (retrieving and modifying past memories) can improve reasoning ability. Insert a module that directly rewrites the KV cache:
Cache Processor role: selectively reprocesses KV "at the end of recently generated reasoning steps." The Processor activates when the model ends a reasoning step with a newline (
\n). The Processor performs in-place rewrite on two parts:- Entire KV of the recent step (consolidation)
- Past top-k KVs that the recent step strongly attended to (reconsolidation)
The Bottlenecked Transformer rewrites the KV cache like memory, reducing unnecessary information and strengthening only prediction-critical information. It performs a latent-space 'cleanup' process at each reasoning step, significantly enhancing mathematical and reasoning capabilities.
arxiv.org
https://arxiv.org/pdf/2505.16950

Seonglae Cho