Positional Embedding

Positional consideration like
CNN

The process of creating a vector to provide positional information of a word using embeddings that indicate the position of a word. The Transformer model provides separate location information as it is not sensitive to order, but understands the relevance of each element in the input sequence. In other words,

Self-Attention does not indicate order, so it is added to the embedding that provides order, enhancing the transformer model's effectiveness.

In terms of terminology, Positional Encoding is a method of directly creating position embedding vectors using a deterministic function, and Positional Embedding is a method of creating position embedding vectors by constructing a trainable embedding layer. In other words, as the model is trained, Positional Encoding is not updated, but Positional Embedding is updated.

Positional Encoding can always create position embeddings, even if the length of the input sentence is very long. However, Positional Embedding cannot create position embeddings for sentences longer than the size of the embedding layer. Of course, it can be created by separating the case, and the length limitation of the transformer model is due to memory and computing resource limitations. If the model has not experienced sequences of a certain length or longer during the learning process, the model will struggle to predict when processing such long sequences, leading to an increase in perplexity.

At their most basic level, positional embeddings are kind of like token addresses. Attention heads can use them to prefer to attend to tokens at certain relative positions, but they can also do much more; Transformers can do "pointer arithmetic" type operations on positional embeddings. The induction head then uses Q-composition to rotate that position embedding one token forward, and thereby attend to the following token.

Positional Encoding Notion

Absolute Positional Encoding

Relative Positional Encoding