ALiBi

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2026 Jan 2 11:36
Editor
Edited
Edited
2026 Jan 2 11:44
Refs
Refs

Attention with Linear Biases

Traditional Transformer positional embeddings (especially sinusoidal) have poor extrapolation (length generalization) on inputs longer than those seen during training. ALiBi does not use positional embeddings, but instead directly adds a linear penalty to attention scores based on query–key distance. Specifically, it adds a linear bias of the form distance × slope to attention scores. By using only distance information, it is robust to length extrapolation. This enables stable performance maintenance/improvement on long inputs even when trained on shorter sequences. No additional parameters required. Experiments on WikiText-103 with a 1.3B model show lower perplexity on longer contexts and reduced training costs.
 
 
 
 
 
 
 

Recommendations