Texonom
Texonom
/
Engineering
Engineering
/Data Engineering/Artificial Intelligence/
Distributed ML
Search

Distributed ML

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2022 Mar 15 11:35
Editor
Editor
Seonglae ChoSeonglae Cho
Edited
Edited
2026 Jan 6 17:22
Refs
Refs

Multi-GPU | multi-node Distributed Training, Federated learning

Data parallelism or model parallelism

  • In data parallelism, the data is split into multiple parts
  • in model parallelism, different parts of the model are processed by separate processors

These parallelism are states as 4D parallelism or 3D parallelism

Parallel Training Notion
Data Parallelism
Tensor Parallelism
Pipeline parallelism
Sequence Parallelism
Model Parallelism
AI Model Memory
Expert Parallelism
Context Parallelism
 
 
Distributed ML Tools
Ray AI
Deepspeed
 
 
 
https://xiandong79.github.io
 
 

Large Scale

huggingface.co
https://huggingface.co/spaces/nanotron/ultrascale-playbook
Model Parallelism
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
Model Parallelism
https://huggingface.co/docs/transformers/v4.15.0/parallelism
Model Parallelism
arxiv.org
https://arxiv.org/pdf/2205.05198.pdf
soumith.ch
https://soumith.ch/blog/2024-10-02-training-10k-scale.md.html
 
 

Table of Contents
Multi-GPU | multi-node Distributed Training, Federated learningData parallelism or model parallelismThese parallelism are states as 4D parallelism or 3D parallelismLarge Scale

Backlinks

AI Optimizationpython contextvarsModel OptimizerPytorchAI FrameworkMoE RoutingDistributed ML

Recommendations

Texonom
Texonom
/
Engineering
Engineering
/Data Engineering/Artificial Intelligence/
Distributed ML
Copyright Seonglae Cho