To avoid repeating the same operation, a thing called KV Cache is stored, which is very large in size. Because the capacity of the KV Cache is very large, it needs to be distributed and stored in a large number of nodes, but the increase in accelerator communication causes data movement to become a bottleneck. At this time, it proposes a method of performing the attention operation itself in a blockwise manner by tying the accelerator to Ring Topology and continuously sending a part of the key value to the next device without overlap.
This allows for distribution in the sequence dimension, theoretically supporting a context window of over 1M. It is mentioned in this paper that if we do this, even if we increase the Context length to more than 512K, the retrieval performance does not decrease. This means that if you use sufficiently long data and learn in a standard way, the transformer has the ability to catch the relationship between very long data itself.