Self Attention is the core feature
Transformer is the first thing that actually scales. Before the Transformer, RNN such as LSTM and stacking them does not get clean scaling.
The Transformer gains a wider perspective and can attend to multiple interaction levels within the input sentence. Unlike CNN and RNN, a significant advancement is the improvement in handling distant Long-term dependency. Transformer Model is not just proficient in Language modeling but also versatile token sequence model with broader application across domains.
The model enables parallel processing by computing all tokens simultaneously, and unlike previous Attention Mechanisms, the paper uses all vectors as weight vectors.
After this paper, major changes in the field include the positioning of Layer Normalization, replacement with RMS Normalization, and the use of GLU as FFN activation.
Transformer Model Notion
Transformer Models
Transformer Visualization
Complete 3D visualization
Matrix form details
Blockwise flow