Bottlenecked Transformer

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Nov 27 22:53
Editor
Edited
Edited
2025 Nov 27 22:59
Refs
Refs
Transformers continuously accumulate KV cache → preserving past information as-is excessively. From an Information Bottleneck perspective, this makes input information I(X;Z) too large, hindering generalization (reasoning ability). In contrast, RNNs recompress information at each step, making them stronger at rule-based reasoning.
Transformers can also mimic the human brain: noting that Consolidation (stabilizing new memories) and Reconsolidation (retrieving and modifying past memories) can improve reasoning ability. Insert a module that directly rewrites the KV cache:
Cache Processor role: selectively reprocesses KV "at the end of recently generated reasoning steps." The Processor activates when the model ends a reasoning step with a newline (\n). The Processor performs in-place rewrite on two parts:
  • Entire KV of the recent step (consolidation)
  • Past top-k KVs that the recent step strongly attended to (reconsolidation)
The Bottlenecked Transformer rewrites the KV cache like memory, reducing unnecessary information and strengthening only prediction-critical information. It performs a latent-space 'cleanup' process at each reasoning step, significantly enhancing mathematical and reasoning capabilities.
 
 
arxiv.org
 

Recommendations