YaRN

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2023 Sep 16 8:45
Editor
Edited
Edited
2024 Jul 8 15:2
Refs
Refs

Yet another RoPE extensioN

The reparametrization of RoPE as a set of 2D matrices has a clear benefit on the implementation of this attention scaling: we can instead use a “length scaling” trick which scales both qm and kn by a constant factor p 1/t by simply scaling the complex RoPE embeddings by the same amount.
 
 

Results

  • YaRN isn’t just good at making sense of longer sentences during fine-tuning, it can also understand things beyond what it learned from the limited context data during fine-tuning.
  • Dynamic-YaRN, combined with Dynamic Scaling at inference time, allows for more than 2x context window extension without any fine-tuning.
  • YaRN allows efficient extrapolation with finetuning on shorter datasets and can take advantage of transfer learning for faster convergence.
 
 
 
 
 

Recommendations