Transformer Training

Creator

Creator

Seonglae Cho

Created

Created

2024 Feb 21 9:48

Editor

Editor

Seonglae Cho

Edited

Edited

2024 Nov 25 15:42

Refs

Refs

Model Training Tool

Deep double descent

Update weights with reading tokens

Basically, the starting point of back propagation is a cross-entropy loss that measures the difference between the softmax logits distribution calculated through the softmax function and the actual labels until the

Positional Embedding and Token Embedding.

Techniques

SGD/Adam is not as well behaved as other NN’s

Require a lot of data and Larger batch sizes than usual with lower learning rates

Data Shuffling is important

At the start of training, traditional transfomers are unstable which is caused by unbalanced gradient. So we use
Learning rate Warmup

Unstable gradient is understood to be due to
Residual Connection &
Layer Normalization and potentially amplified by Adam
We need Adaptive learning rate where is algorithmic step and is the number of warmup steps

catering the learning rate to the largest gradient with warmup

Regularization through dropout and label smoothing to prevent overfitting

Residual connections require normalization to prevent output variance explosion when used with SOTA

Weight Initialization (Xavier, He). This is because as variance continues to increase even though

Xavier Initialization aims to have constant variance since there are two channels including identity residual like , explosive gradient problems can occur as the network gets deeper.

Despite this, the reason transformer training remains unstable is because

Xavier Initialization and training algorithms (SGD with

Adam Optimizer) are not perfectly stable. Zhang et al., 2019 suggested that ResNet-style models could be trained stably without normalization through proper initialization and settings. This opens the door for further research into stabilizing Transformer training through customized initialization and optimization strategies.

Transformer Training Steps

Transformer Pretraining

Transformer Finetuning

Technical tips

Training Tips for the Transformer Model

This article describes our experiments in neural machine translation using the recent Tensor2Tensor framework and the Transformer sequence-to-sequence model (Vaswani et al., 2017). We examine some...

https://arxiv.org/abs/1804.00247

A little guide to building Large Language Models in 2024

A little guide through all you need to know to train a good performance large language model in 2024. This is an introduction talk with link to references for further reading. This is the first video of a 2 part series: - Video 1 (this video): covering all the concepts to train a good performance LLM in 2024 - Video 2 (next video): hands-on applying all these concepts with code example This video is adapted from a talk I gave in 2024 at a AI/ML winter school for graduate student. When I shared the slides online people kept asking for a recording of the unrecorded class so I decided to spend a morning recording it to share it more widely along the slides. Link to the slides: https://docs.google.com/presentation/d/1IkzESdOwdmwvPxIELYJi8--K3EZ98_cL6c5ZcLKSyVg/mobilepresent?slide=id.p Chapters: 00:00:00 Intro 00:00:59 Workflow for LLMs Part 1: Training: data 00:01:17 Data preparation - intro and good recent ressources on data preparation 00:05:28 A web scale pretraining corpus - goals and challenges 00:11:29 Web scale data sources – Focus on recent datasets 00:18:01 Language, and quality filtering 00:24:34 Diving in data deduplication 00:27:40 Final data preparation for training 00:31:31 How to evaluate data quality at scale 00:36:29 The datatrove and lighteval libraries Part 2: Training: modeling 00:38:18 Introduction in modeling technics for LLM training 00:39:09 When the model is too big: parallelism 00:40:00 Data parallelism 00:41:18 Tensor parallelism 00:44:38 Pipeline parallelism 00:47:00 Sequence parallelism and references on 4D parallelism 00:47:52 Synchronisation: GPU-CPU and GPU-GPU challenges 00:52:14 Flash attention v1 and v2 00:56:23 Stable training recipes 00:59:12 New architectures: Mixture-of-experts 01:03:13 New architectures: Mamba 01:04:49 The nanotron library Part 3: Fine-tuning: RLHF and alignement 01:06:15 RLHF in 2024 01:08:23 PPO, DPO and REINFORCE Part 4: Fast inference techniques 01:11:23 Quantization, speculative decoding and compilation: overview and ressources End 01:14:36 Sharing your model, datasets and demo – final words

A little guide to building Large Language Models in 2024

https://www.youtube.com/watch?v=2-SPH9hIKT8

A little guide to building Large Language Models in 2024

Philipp Schmid on LinkedIn: How are open LLMs trained and created in 2024? 🤔 01.AI just released… | 18 comments

How are open LLMs trained and created in 2024? 🤔 01.AI just released their paper on how they created the YI, a family of LLMs and V-LLMs. The paper includes… | 18 comments on LinkedIn

https://www.linkedin.com/posts/philipp-schmid-a6a2bb196_how-are-open-llms-trained-and-created-in-activity-7172185170664505345-oTV9/

Philipp Schmid on LinkedIn: How are open LLMs trained and created in 2024? 🤔 01.AI just released… | 18 comments

MiniCPM: Unveiling the Potential of End-side Large Language Models | Notion

Authors: Shengding Hu, Yuge Tu, Xu Han*, Ganqu Cui, Chaoqun He, Weilin Zhao, Xiang Long, Zhi Zheng, Yewei Fang, Kaihuo Zhang, Yuxiang Huang, Zhenning Dai, Baitao Gong, Chongyi Wang, Yuan Yao, Jie Zhou, Jie Cai, Xinrong Zhang, Zhongwu Zhai, Ning Ding, Chao Jia, Guoyang Zeng, Dahai Li, Zhiyuan Liu*, Maosong Sun

MiniCPM: Unveiling the Potential of End-side Large Language Models | Notion

https://shengdinghu.notion.site/MiniCPM-Unveiling-the-Potential-of-End-side-Large-Language-Models-d4d3a8c426424654a4e80e42a711cb20

MiniCPM: Unveiling the Potential of End-side Large Language Models | Notion

Weight Initialization for transformer with

Learning rate Warmup

Improving Transformer Optimization Through Better Initialization

The Transformer architecture has achieved considerable success recently; the key component of the Transformer is the attention layer that enables the model t...

https://proceedings.mlr.press/v119/huang20f.html

Layer Normalization for transformer with

https://arxiv.org/pdf/2002.04745

Backlinks

Transformer Modeling Transformer Model

Recommendations

/////////