Yet another RoPE extensioN
The reparametrization of RoPE as a set of 2D matrices has a clear benefit on the implementation of this attention scaling: we can instead use a “length scaling” trick which scales both qm and kn by a constant factor p 1/t by simply scaling the complex RoPE embeddings by the same amount.
Results
- YaRN isn’t just good at making sense of longer sentences during fine-tuning, it can also understand things beyond what it learned from the limited context data during fine-tuning.
- Dynamic-YaRN, combined with Dynamic Scaling at inference time, allows for more than 2x context window extension without any fine-tuning.
- YaRN allows efficient extrapolation with finetuning on shorter datasets and can take advantage of transfer learning for faster convergence.