Ring Attention

Created
Created
2024 Mar 1 16:0
Creator
Creator
Seonglae ChoSeonglae Cho
Editor
Edited
Edited
2024 Mar 31 10:25
To avoid repeating the same operation, a thing called KV Cache is stored, which is very large in size. Because the capacity of the KV Cache is very large, it needs to be distributed and stored in a large number of nodes, but the increase in accelerator communication causes data movement to become a bottleneck. At this time, it proposes a method of performing the attention operation itself in a blockwise manner by tying the accelerator to
Ring Topology
and continuously sending a part of the key value to the next device without overlap.
This allows for distribution in the sequence dimension, theoretically supporting a context window of over 1M. It is mentioned in this paper that if we do this, even if we increase the Context length to more than 512K, the retrieval performance does not decrease. This means that if you use sufficiently long data and learn in a standard way, the transformer has the ability to catch the relationship between very long data itself.
https://arxiv.org/pdf/2402.08268.pdf
 
 
 
 
 
 

Recommendations