Yet another RoPE extensioN
The reparametrization of RoPE as a set of 2D matrices has a clear benefit on the implementation of this attention scaling: we can instead use a “length scaling” trick which scales both qm and kn by a constant factor p 1/t by simply scaling the complex RoPE embeddings by the same amount.
Results
- YaRN isn’t just good at making sense of longer sentences during fine-tuning, it can also understand things beyond what it learned from the limited context data during fine-tuning.
- Dynamic-YaRN, combined with Dynamic Scaling at inference time, allows for more than 2x context window extension without any fine-tuning.
- YaRN allows efficient extrapolation with finetuning on shorter datasets and can take advantage of transfer learning for faster convergence.
YaRN: Efficient Context Window Extension of Large Language Models
Rotary Position Embeddings (RoPE) have been shown to effectively encode positional information in transformer-based language models. However, these models fail to generalize past the sequence...
https://arxiv.org/abs/2309.00071

Paper page - YaRN: Efficient Context Window Extension of Large Language Models
Join the discussion on this paper page
https://huggingface.co/papers/2309.00071
NousResearch/Yarn-Llama-2-13b-128k · Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
https://huggingface.co/NousResearch/Yarn-Llama-2-13b-128k
Understanding YaRN: Extending Context Window of LLMs
YaRN: Yet another RoPE extensioN method
https://medium.com/@rcrajatchawla/understanding-yarn-extending-context-window-of-llms-3f21e3522465


Seonglae Cho