Dataset Filtering

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2024 Mar 31 10:13
Editor
Edited
Edited
2024 Jun 16 3:15
Refs
  • Quality Filtering
  • Deduplication
notion image
 
https://www.youtube.com/watch?v=2-SPH9hIKT8
 
 
 
 
 

Similarity based filtering

  • Fuzzy deduplication on a document level

Dataset augmentation

  • to increase diversity
 
 
 
A little guide to building Large Language Models in 2024
A little guide through all you need to know to train a good performance large language model in 2024. This is an introduction talk with link to references for further reading. This is the first video of a 2 part series: - Video 1 (this video): covering all the concepts to train a good performance LLM in 2024 - Video 2 (next video): hands-on applying all these concepts with code example This video is adapted from a talk I gave in 2024 at a AI/ML winter school for graduate student. When I shared the slides online people kept asking for a recording of the unrecorded class so I decided to spend a morning recording it to share it more widely along the slides. Link to the slides: https://docs.google.com/presentation/d/1IkzESdOwdmwvPxIELYJi8--K3EZ98_cL6c5ZcLKSyVg/mobilepresent?slide=id.p Chapters: 00:00:00 Intro 00:00:59 Workflow for LLMs Part 1: Training: data 00:01:17 Data preparation - intro and good recent ressources on data preparation 00:05:28 A web scale pretraining corpus - goals and challenges 00:11:29 Web scale data sources – Focus on recent datasets 00:18:01 Language, and quality filtering 00:24:34 Diving in data deduplication 00:27:40 Final data preparation for training 00:31:31 How to evaluate data quality at scale 00:36:29 The datatrove and lighteval libraries Part 2: Training: modeling 00:38:18 Introduction in modeling technics for LLM training 00:39:09 When the model is too big: parallelism 00:40:00 Data parallelism 00:41:18 Tensor parallelism 00:44:38 Pipeline parallelism 00:47:00 Sequence parallelism and references on 4D parallelism 00:47:52 Synchronisation: GPU-CPU and GPU-GPU challenges 00:52:14 Flash attention v1 and v2 00:56:23 Stable training recipes 00:59:12 New architectures: Mixture-of-experts 01:03:13 New architectures: Mamba 01:04:49 The nanotron library Part 3: Fine-tuning: RLHF and alignement 01:06:15 RLHF in 2024 01:08:23 PPO, DPO and REINFORCE Part 4: Fast inference techniques 01:11:23 Quantization, speculative decoding and compilation: overview and ressources End 01:14:36 Sharing your model, datasets and demo – final words
A little guide to building Large Language Models in 2024
 
 
 

Recommendations