Rotary Positional Embedding
Positional Interpolation between start and end of the sentence (which cause Lost in the middle)
Relative Positional Encoding changes the Attention score calculation according to the relative distance between tokens. RoPE is representative, and the feature of this encoding is to indicate the position using vector rotation operation, where the distance between two tokens and the angle rotated by the max context window size are determined. So, wouldn't it be possible to process long data while maintaining the information learned for short data by first learning for short data, then increasing the model's context windows size and proportionally reducing the rotation speed for fine-tuning for long data?
RoPE helps the model encode 'relative distance' by applying frequency-based rotations to Q and K vectors. These rotation angles focus on distinguishing subtle order differences between tokens at short distances with higher frequencies, while more gently reflecting relationships between tokens at longer distances with lower frequencies. Just as human time perception uses various cycles (seconds/minutes/hours/days/months/years) to comprehensively understand changes from momentary to long-term flows, RoPE also captures positional relationships between tokens at multiple scales through rotations of various frequencies.
The model, trained for lengths of 2k and 4k, worked well without a significant drop in perplexity even when extended to 16k and 32k. Various methods of position interpolation using RoPE's characteristics have been studied. Instead of finetuning the model, there was a bright prospect that it could be applied with RAG to any desired service as long as there was enough data, utilizing the in-context learning ability of the transformer. The belief that LLMs with RoPE have a natural ability to handle long texts, even if they haven't encountered super-long ones during training. It effectively encode positional information in transformer-based language models but fail to generalize past the sequence length they were trained on.
Limitation of RoPE
When models are pretrained on relatively short contexts and then extended to learn longer contexts, applying the same rotation patterns to distances not seen in the initial training phase can introduce inaccurate positional signals. In longer distances that exceed the range of initial training, frequency-based rotations repeat the same state at regular intervals, causing different distances to be incorrectly perceived as having the same positional relationship. This distorts the correlations between very distant tokens and hinders the model's ability to comprehensively understand long contexts. It's like completing multiple rotations and returning to the starting position, which could be misinterpreted as not having moved at all.
RoPE Extensions
RoFormer roformerZhuiyiTechnology • Updated 2025 Aug 22 11:26
roformer
ZhuiyiTechnology • Updated 2025 Aug 22 11:26
LongRope
Visualized