Update weights with reading tokens
Basically, the starting point of back propagation is a cross-entropy loss that measures the difference between the softmax logits distribution calculated through the softmax function and the actual labels until the Positional Embedding and Token Embedding.
Techniques
- SGD/Adam is not as well behaved as other NN’s
- Require a lot of data and Larger batch sizes than usual with lower learning rates
- Data Shuffling is important
- At the start of training, traditional transfomers are unstable which is caused by unbalanced gradient. So we useLearning rate Warmup
- Unstable gradient is understood to be due to Residual Connection & Layer Normalization and potentially amplified by Adam
- We need Adaptive learning rate where is algorithmic step and is the number of warmup steps
- catering the learning rate to the largest gradient with warmup
- Regularization through dropout and label smoothing to prevent overfitting
Residual connections require normalization to prevent output variance explosion when used with SOTA Weight Initialization (Xavier, He). This is because as variance continues to increase even though Xavier Initialization aims to have constant variance since there are two channels including identity residual like , explosive gradient problems can occur as the network gets deeper.
Despite this, the reason transformer training remains unstable is because Xavier Initialization and training algorithms (SGD with Adam Optimizer) are not perfectly stable. Zhang et al., 2019 suggested that ResNet-style models could be trained stably without normalization through proper initialization and settings. This opens the door for further research into stabilizing Transformer training through customized initialization and optimization strategies.
Transformer Training Steps
Technical tips
Weight Initialization for transformer with Learning rate Warmup
Layer Normalization for transformer with Pre-Norm