Transformer Training

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2024 Feb 21 9:48
Editor
Edited
Edited
2024 Nov 25 15:42

Update weights with reading tokens

Basically, the starting point of back propagation is a cross-entropy loss that measures the difference between the softmax logits distribution calculated through the softmax function and the actual labels until the
Positional Embedding
and Token Embedding.

Techniques

  • SGD/Adam is not as well behaved as other NN’s
  • Require a lot of data and Larger batch sizes than usual with lower learning rates
  • At the start of training, traditional transfomers are unstable which is caused by unbalanced gradient. So we use
    Learning rate Warmup
    • Unstable gradient is understood to be due to
      Residual Connection
      &
      Layer Normalization
      and potentially amplified by Adam
    • We need Adaptive learning rate where is algorithmic step and is the number of warmup steps
      • catering the learning rate to the largest gradient with warmup
  • Regularization through dropout and label smoothing to prevent overfitting
Residual connections require normalization to prevent output variance explosion when used with SOTA
Weight Initialization
(Xavier, He). This is because as variance continues to increase even though
Xavier Initialization
aims to have constant variance since there are two channels including identity residual like , explosive gradient problems can occur as the network gets deeper.
Despite this, the reason transformer training remains unstable is because
Xavier Initialization
and training algorithms (SGD with
Adam Optimizer
) are not perfectly stable. Zhang et al., 2019 suggested that ResNet-style models could be trained stably without normalization through proper initialization and settings. This opens the door for further research into stabilizing Transformer training through customized initialization and optimization strategies.
Transformer Training Steps
 
 
 

Technical tips

A little guide to building Large Language Models in 2024
A little guide through all you need to know to train a good performance large language model in 2024. This is an introduction talk with link to references for further reading. This is the first video of a 2 part series: - Video 1 (this video): covering all the concepts to train a good performance LLM in 2024 - Video 2 (next video): hands-on applying all these concepts with code example This video is adapted from a talk I gave in 2024 at a AI/ML winter school for graduate student. When I shared the slides online people kept asking for a recording of the unrecorded class so I decided to spend a morning recording it to share it more widely along the slides. Link to the slides: https://docs.google.com/presentation/d/1IkzESdOwdmwvPxIELYJi8--K3EZ98_cL6c5ZcLKSyVg/mobilepresent?slide=id.p Chapters: 00:00:00 Intro 00:00:59 Workflow for LLMs Part 1: Training: data 00:01:17 Data preparation - intro and good recent ressources on data preparation 00:05:28 A web scale pretraining corpus - goals and challenges 00:11:29 Web scale data sources – Focus on recent datasets 00:18:01 Language, and quality filtering 00:24:34 Diving in data deduplication 00:27:40 Final data preparation for training 00:31:31 How to evaluate data quality at scale 00:36:29 The datatrove and lighteval libraries Part 2: Training: modeling 00:38:18 Introduction in modeling technics for LLM training 00:39:09 When the model is too big: parallelism 00:40:00 Data parallelism 00:41:18 Tensor parallelism 00:44:38 Pipeline parallelism 00:47:00 Sequence parallelism and references on 4D parallelism 00:47:52 Synchronisation: GPU-CPU and GPU-GPU challenges 00:52:14 Flash attention v1 and v2 00:56:23 Stable training recipes 00:59:12 New architectures: Mixture-of-experts 01:03:13 New architectures: Mamba 01:04:49 The nanotron library Part 3: Fine-tuning: RLHF and alignement 01:06:15 RLHF in 2024 01:08:23 PPO, DPO and REINFORCE Part 4: Fast inference techniques 01:11:23 Quantization, speculative decoding and compilation: overview and ressources End 01:14:36 Sharing your model, datasets and demo – final words
A little guide to building Large Language Models in 2024
Layer Normalization
for transformer with
Pre-Norm
 
 

 

Recommendations