Infini Transformer

Model-Internal RAG implementing
Working memory similar

1 million token context window tested, but there is no upper limit

Utilizes standard local attention mechanisms found in transformers.

Integrates a global attention mechanism through a compression technique.

Merges both local and global attention to manage extended contexts efficiently.

The LLM performs attention while simultaneously running self-attention on the KV pairs stored in memory boxes using Q, then mixes the two results using a gating scalar. This enables both local attention and Global Memory Retrieval through attention mechanisms. Memory updates occur by adding KV computation results from new segments to matrix M. This can be implemented without additional training or weights. The reason this works is due to the linear representation hypothesis - it retains memory simply through addition.

Leave No Context Behind - A Comment — LessWrong

Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention by Tsendsuren Munkhdalai, Manaal Faruqui, and Siddharth Gopal…

https://www.lesswrong.com/posts/gcv86vXgQhbBDNGNx/leave-no-context-behind-a-comment

arxiv.org

https://arxiv.org/pdf/2404.07143

Infini Transformer

Model-Internal RAG implementing Working memory similar

Recommendations

Model-Internal RAG implementing
Working memory similar