Attention with Linear Biases
Traditional Transformer positional embeddings (especially sinusoidal) have poor extrapolation (length generalization) on inputs longer than those seen during training. ALiBi does not use positional embeddings, but instead directly adds a linear penalty to attention scores based on query–key distance. Specifically, it adds a linear bias of the form
distance × slope to attention scores. By using only distance information, it is robust to length extrapolation. This enables stable performance maintenance/improvement on long inputs even when trained on shorter sequences. No additional parameters required. Experiments on WikiText-103 with a 1.3B model show lower perplexity on longer contexts and reduced training costs.
Seonglae Cho