Only Portion of pretraining
How to avoid collapese: ToEdit (token level edit)
arxiv.org
https://arxiv.org/pdf/2412.14689
Cosmopedia for SmolAgents LM useful to On-device AI small llm
Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
https://huggingface.co/blog/cosmopedia
Unconditional Generation
FaithfulSAE: Towards Capturing Faithful Features with Sparse...
Sparse Autoencoders (SAEs) have emerged as a promising solution for decomposing large language model representations into interpretable features. However, Paulo and Belrose (2025) have highlighted...
https://www.arxiv.org/abs/2506.17673

arxiv.org
https://arxiv.org/pdf/2510.18554

Seonglae Cho