- Tokenizer Training (Rust BPE)
- Train BPE tokenizer on FineWeb-EDU data
- Configure vocab + special tokens
- Base Pretraining
- Train general language model on large-scale text (FineWeb-EDU)
- Next-token prediction objective
- Mid-training (Intermediate Adaptation Stage)
- Use mixed data from SmolTalk (conversation), MMLU, GSM8K
- Expand model's internal reasoning and world knowledge without masking and chat special token
- SFT (Supervised Fine-Tuning, Chat Format)
- Use user ↔ assistant conversation format data
- Masking user messages, backpropagate assistant only
- Optional RL (GRPO / REINFORCE)
- Update model based on rewards for GSM8K problems
- Generate multiple answer samples → apply policy gradient
‣
LLMs should be viewed not as a single model but as a model family controlled by a single dial: compute, which allows us to verify scaling laws and build confidence in "large-scale training". Using depth as the dial, train d10~d20 models under the same FLOPs budget. If the curves don't cross, each model is compute-optimal.

Seonglae Cho