Pretraining Synthetic Data Generation

Creator

Creator

Seonglae Cho

Created

Created

2025 Mar 6 16:44

Editor

Editor

Seonglae Cho

Edited

Edited

2025 Aug 13 14:49

Refs

Refs

Only Portion of pretraining

How to avoid collapese: ToEdit (token level edit)

https://arxiv.org/pdf/2412.14689

Cosmopedia for

SmolAgents LM useful to

On-device AI small llm

Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models

https://huggingface.co/blog/cosmopedia

Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models

Recommendations

/////////