Update weights with reading tokens
Basically, the starting point of back propagation is a cross-entropy loss that measures the difference between the softmax logits distribution calculated through the softmax function and the actual labels until the Positional Embedding and Token Embedding.
Techniques
- SGD/Adam is not as well behaved as other NN’s
- Require a lot of data and Larger batch sizes than usual with lower learning rates
- Data Shuffling is important
- At the start of training, traditional transfomers are unstable which is caused by unbalanced gradient. So we useLearning rate Warmup
- Unstable gradient is understood to be due to Residual Connection & Layer Normalization and potentially amplified by Adam
- We need Adaptive learning rate where is algorithmic step and is the number of warmup steps
- catering the learning rate to the largest gradient with warmup
- Regularization through dropout and label smoothing to prevent overfitting
Residual connections require normalization to prevent output variance explosion when used with SOTA Weight Initialization (Xavier, He). This is because as variance continues to increase even though Xavier Initialization aims to have constant variance since there are two channels including identity residual like , explosive gradient problems can occur as the network gets deeper.
Despite this, the reason transformer training remains unstable is because Xavier Initialization and training algorithms (SGD with Adam Optimizer) are not perfectly stable. Zhang et al., 2019 suggested that ResNet-style models could be trained stably without normalization through proper initialization and settings. This opens the door for further research into stabilizing Transformer training through customized initialization and optimization strategies.
Transformer Training Steps
Technical tips
Training Tips for the Transformer Model
This article describes our experiments in neural machine translation using the recent Tensor2Tensor framework and the Transformer sequence-to-sequence model (Vaswani et al., 2017). We examine some...
https://arxiv.org/abs/1804.00247

A little guide to building Large Language Models in 2024
A little guide through all you need to know to train a good performance large language model in 2024.
This is an introduction talk with link to references for further reading.
This is the first video of a 2 part series:
- Video 1 (this video): covering all the concepts to train a good performance LLM in 2024
- Video 2 (next video): hands-on applying all these concepts with code example
This video is adapted from a talk I gave in 2024 at a AI/ML winter school for graduate student. When I shared the slides online people kept asking for a recording of the unrecorded class so I decided to spend a morning recording it to share it more widely along the slides.
Link to the slides: https://docs.google.com/presentation/d/1IkzESdOwdmwvPxIELYJi8--K3EZ98_cL6c5ZcLKSyVg/mobilepresent?slide=id.p
Chapters:
00:00:00 Intro
00:00:59 Workflow for LLMs
Part 1: Training: data
00:01:17 Data preparation - intro and good recent ressources on data preparation
00:05:28 A web scale pretraining corpus - goals and challenges
00:11:29 Web scale data sources – Focus on recent datasets
00:18:01 Language, and quality filtering
00:24:34 Diving in data deduplication
00:27:40 Final data preparation for training
00:31:31 How to evaluate data quality at scale
00:36:29 The datatrove and lighteval libraries
Part 2: Training: modeling
00:38:18 Introduction in modeling technics for LLM training
00:39:09 When the model is too big: parallelism
00:40:00 Data parallelism
00:41:18 Tensor parallelism
00:44:38 Pipeline parallelism
00:47:00 Sequence parallelism and references on 4D parallelism
00:47:52 Synchronisation: GPU-CPU and GPU-GPU challenges
00:52:14 Flash attention v1 and v2
00:56:23 Stable training recipes
00:59:12 New architectures: Mixture-of-experts
01:03:13 New architectures: Mamba
01:04:49 The nanotron library
Part 3: Fine-tuning: RLHF and alignement
01:06:15 RLHF in 2024
01:08:23 PPO, DPO and REINFORCE
Part 4: Fast inference techniques
01:11:23 Quantization, speculative decoding and compilation: overview and ressources
End
01:14:36 Sharing your model, datasets and demo – final words
https://www.youtube.com/watch?v=2-SPH9hIKT8

Philipp Schmid on LinkedIn: How are open LLMs trained and created in 2024? 🤔 01.AI just released… | 18 comments
How are open LLMs trained and created in 2024? 🤔 01.AI just released their paper on how they created the YI, a family of LLMs and V-LLMs. The paper includes… | 18 comments on LinkedIn
https://www.linkedin.com/posts/philipp-schmid-a6a2bb196_how-are-open-llms-trained-and-created-in-activity-7172185170664505345-oTV9/
MiniCPM: Unveiling the Potential of End-side Large Language Models | Notion
Authors: Shengding Hu, Yuge Tu, Xu Han*, Ganqu Cui, Chaoqun He, Weilin Zhao, Xiang Long, Zhi Zheng, Yewei Fang, Kaihuo Zhang, Yuxiang Huang, Zhenning Dai, Baitao Gong, Chongyi Wang, Yuan Yao, Jie Zhou, Jie Cai, Xinrong Zhang, Zhongwu Zhai, Ning Ding, Chao Jia, Guoyang Zeng, Dahai Li, Zhiyuan Liu*, Maosong Sun
https://shengdinghu.notion.site/MiniCPM-Unveiling-the-Potential-of-End-side-Large-Language-Models-d4d3a8c426424654a4e80e42a711cb20

Improving Transformer Optimization Through Better Initialization
The Transformer architecture has achieved considerable success recently; the key component of the Transformer is the attention layer that enables the model t...
https://proceedings.mlr.press/v119/huang20f.html

Seonglae Cho