Infini Transformer

Model-Internal RAG implementing
Working memory
similar

1 million token context window tested, but there is no upper limit
  • Utilizes standard local attention mechanisms found in transformers.
  • Integrates a global attention mechanism through a compression technique.
  • Merges both local and global attention to manage extended contexts efficiently.
The LLM performs attention while simultaneously running self-attention on the KV pairs stored in memory boxes using Q, then mixes the two results using a gating scalar. This enables both local attention and Global Memory Retrieval through attention mechanisms. Memory updates occur by adding KV computation results from new segments to matrix M. This can be implemented without additional training or weights. The reason this works is due to the linear representation hypothesis - it retains memory simply through addition.
 
 
 
 

Recommendations