Texonom
Texonom
/
Engineering
Engineering
/Data Engineering/Artificial Intelligence/AI Data/Dataset/NLP Dataset/Synthetic Dataset/Synthetic Data Generation/
Pretraining Synthetic Data Generation
Search

Pretraining Synthetic Data Generation

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Mar 6 16:44
Editor
Editor
Seonglae ChoSeonglae Cho
Edited
Edited
2025 Nov 7 11:37
Refs
Refs
Model Collapse
Pretraining
 
 
 
 
 
 

Only Portion of pretraining

How to avoid collapese: ToEdit (token level edit)
arxiv.org
https://arxiv.org/pdf/2412.14689
Phi
Cosmopedia for
SmolAgents
LM useful to
On-device AI
small llm
Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models
https://huggingface.co/blog/cosmopedia
Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models

Unconditional Generation

FaithfulSAE: Towards Capturing Faithful Features with Sparse...
Sparse Autoencoders (SAEs) have emerged as a promising solution for decomposing large language model representations into interpretable features. However, Paulo and Belrose (2025) have highlighted...
FaithfulSAE: Towards Capturing Faithful Features with Sparse...
https://www.arxiv.org/abs/2506.17673
FaithfulSAE: Towards Capturing Faithful Features with Sparse...
arxiv.org
https://arxiv.org/pdf/2510.18554
 
 

Recommendations

Texonom
Texonom
/
Engineering
Engineering
/Data Engineering/Artificial Intelligence/AI Data/Dataset/NLP Dataset/Synthetic Dataset/Synthetic Data Generation/
Pretraining Synthetic Data Generation
Copyright Seonglae Cho