AI Scaling Prediction

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Dec 18 16:17
Editor
Edited
Edited
2025 Dec 18 16:19
Refs
Refs
 
 
 
 
 
As models grow, the shape of learning dynamics remains unchanged, only speed/scale changes. When training LLMs of various sizes (parameter counts) together, if conditions are met, the entire training loss curve "collapses" into a single curve, and the paper proposes methods to use this for training efficiency, debugging, and early stopping in hyperparameter optimization (HPO). Different model sizes, but when axes are transformed, it means essentially the same training process.
TPP (Tokens-per-Parameter = D/N) fixed, 2) AdamW timescale τ fixed (the "EMA memory length" created by learning rate, weight decay, and batch together), 3) LR schedule fixed, then this occurs.
Can predict the end by looking at the beginning (early stopping / HPO cost reduction), provides basis for transferring optimizer settings found in small models to large models / when residuals deviating from collapse appear, early detection of training problems like numerical issues, kernel issues, or restart needs is possible well before anomalies become obvious in the original loss graph. Using the collapse curve as a "reference trajectory," the paper claims that in large model HPO, training only 10-30% can predict final performance (final loss) and enable early stopping.
 
 
 

Recommendations