Transformer Training

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2024 Feb 21 9:48
Editor
Edited
Edited
2024 Nov 25 15:42

Update weights with reading tokens

Basically, the starting point of back propagation is a cross-entropy loss that measures the difference between the softmax logits distribution calculated through the softmax function and the actual labels until the
Positional Embedding
and Token Embedding.

Techniques

  • SGD/Adam is not as well behaved as other NN’s
  • Require a lot of data and Larger batch sizes than usual with lower learning rates
  • At the start of training, traditional transfomers are unstable which is caused by unbalanced gradient. So we use
    Learning rate Warmup
    • Unstable gradient is understood to be due to
      Residual Connection
      &
      Layer Normalization
      and potentially amplified by Adam
    • We need Adaptive learning rate where is algorithmic step and is the number of warmup steps
      • catering the learning rate to the largest gradient with warmup
  • Regularization through dropout and label smoothing to prevent overfitting
Residual connections require normalization to prevent output variance explosion when used with SOTA
Weight Initialization
(Xavier, He). This is because as variance continues to increase even though
Xavier Initialization
aims to have constant variance since there are two channels including identity residual like , explosive gradient problems can occur as the network gets deeper.
Despite this, the reason transformer training remains unstable is because
Xavier Initialization
and training algorithms (SGD with
Adam Optimizer
) are not perfectly stable. Zhang et al., 2019 suggested that ResNet-style models could be trained stably without normalization through proper initialization and settings. This opens the door for further research into stabilizing Transformer training through customized initialization and optimization strategies.
Transformer Training Steps
 
 
 

Technical tips

Training Tips for the Transformer Model
This article describes our experiments in neural machine translation using the recent Tensor2Tensor framework and the Transformer sequence-to-sequence model (Vaswani et al., 2017). We examine some...
Training Tips for the Transformer Model
A little guide to building Large Language Models in 2024
A little guide through all you need to know to train a good performance large language model in 2024. This is an introduction talk with link to references for further reading. This is the first video of a 2 part series: - Video 1 (this video): covering all the concepts to train a good performance LLM in 2024 - Video 2 (next video): hands-on applying all these concepts with code example This video is adapted from a talk I gave in 2024 at a AI/ML winter school for graduate student. When I shared the slides online people kept asking for a recording of the unrecorded class so I decided to spend a morning recording it to share it more widely along the slides. Link to the slides: https://docs.google.com/presentation/d/1IkzESdOwdmwvPxIELYJi8--K3EZ98_cL6c5ZcLKSyVg/mobilepresent?slide=id.p Chapters: 00:00:00 Intro 00:00:59 Workflow for LLMs Part 1: Training: data 00:01:17 Data preparation - intro and good recent ressources on data preparation 00:05:28 A web scale pretraining corpus - goals and challenges 00:11:29 Web scale data sources – Focus on recent datasets 00:18:01 Language, and quality filtering 00:24:34 Diving in data deduplication 00:27:40 Final data preparation for training 00:31:31 How to evaluate data quality at scale 00:36:29 The datatrove and lighteval libraries Part 2: Training: modeling 00:38:18 Introduction in modeling technics for LLM training 00:39:09 When the model is too big: parallelism 00:40:00 Data parallelism 00:41:18 Tensor parallelism 00:44:38 Pipeline parallelism 00:47:00 Sequence parallelism and references on 4D parallelism 00:47:52 Synchronisation: GPU-CPU and GPU-GPU challenges 00:52:14 Flash attention v1 and v2 00:56:23 Stable training recipes 00:59:12 New architectures: Mixture-of-experts 01:03:13 New architectures: Mamba 01:04:49 The nanotron library Part 3: Fine-tuning: RLHF and alignement 01:06:15 RLHF in 2024 01:08:23 PPO, DPO and REINFORCE Part 4: Fast inference techniques 01:11:23 Quantization, speculative decoding and compilation: overview and ressources End 01:14:36 Sharing your model, datasets and demo – final words
A little guide to building Large Language Models in 2024
Philipp Schmid on LinkedIn: How are open LLMs trained and created in 2024? 🤔 01.AI just released… | 18 comments
How are open LLMs trained and created in 2024? 🤔 01.AI just released their paper on how they created the YI, a family of LLMs and V-LLMs. The paper includes… | 18 comments on LinkedIn
Philipp Schmid on LinkedIn: How are open LLMs trained and created in 2024? 🤔 01.AI just released… | 18 comments
MiniCPM: Unveiling the Potential of End-side Large Language Models | Notion
Authors: Shengding Hu, Yuge Tu, Xu Han*, Ganqu Cui, Chaoqun He, Weilin Zhao, Xiang Long, Zhi Zheng, Yewei Fang, Kaihuo Zhang, Yuxiang Huang, Zhenning Dai, Baitao Gong, Chongyi Wang, Yuan Yao, Jie Zhou, Jie Cai, Xinrong Zhang, Zhongwu Zhai, Ning Ding, Chao Jia, Guoyang Zeng, Dahai Li, Zhiyuan Liu*, Maosong Sun
MiniCPM: Unveiling the Potential of End-side Large Language Models | Notion
Improving Transformer Optimization Through Better Initialization
The Transformer architecture has achieved considerable success recently; the key component of the Transformer is the attention layer that enables the model t...
Layer Normalization
for transformer with
Pre-Norm
arxiv.org
 
 

 

Recommendations