To avoid repeating the same operation, a thing called KV Cache is stored, which is very large in size. Because the capacity of the KV Cache is very large, it needs to be distributed and stored in a large number of nodes, but the increase in accelerator communication causes data movement to become a bottleneck. At this time, it proposes a method of performing the attention operation itself in a blockwise manner by tying the accelerator to Ring Topology and continuously sending a part of the key value to the next device without overlap.
This allows for distribution in the sequence dimension, theoretically supporting a context window of over 1M. It is mentioned in this paper that if we do this, even if we increase the Context length to more than 512K, the retrieval performance does not decrease. This means that if you use sufficiently long data and learn in a standard way, the transformer has the ability to catch the relationship between very long data itself.
TensorFlow KR | 예전에 이 그룹에 Gemini가 어떻게 이 문제를 해결했는지 궁금하다고 썼었는데, Gemini...
예전에 이 그룹에 Gemini가 어떻게 이 문제를 해결했는지 궁금하다고 썼었는데, Gemini 1.5 Pro는 아예 100M 길이의 context에 대해서도 추론이 된다고 하죠. 대단한 발전이라고 할 수 있는데, 이에 대해서 글을 하나 써 보고자 합니다. 오래 전부터 딥 러닝이 해결하고자 했지만 잘 하지 못하고 있었던 숙제를 하나 꼽으라면, Long...
https://www.facebook.com/groups/TensorFlowKR/permalink/2246893962318316/?mibextid=oMANbw


Seonglae Cho