AI Scaling Prediction

As models grow, the shape of learning dynamics remains unchanged, only speed/scale changes. When training LLMs of various sizes (parameter counts) together, if conditions are met, the entire training loss curve "collapses" into a single curve, and the paper proposes methods to use this for training efficiency, debugging, and early stopping in hyperparameter optimization (HPO). Different model sizes, but when axes are transformed, it means essentially the same training process.

TPP (Tokens-per-Parameter = D/N) fixed, 2) AdamW timescale τ fixed (the "EMA memory length" created by learning rate, weight decay, and batch together), 3) LR schedule fixed, then this occurs.

Can predict the end by looking at the beginning (early stopping / HPO cost reduction), provides basis for transferring optimizer settings found in small models to large models / when residuals deviating from collapse appear, early detection of training problems like numerical issues, kernel issues, or restart needs is possible well before anomalies become obvious in the original loss graph. Using the collapse curve as a "reference trajectory," the paper claims that in large model HPO, training only 10-30% can predict final performance (final loss) and enable early stopping.

openreview.net

https://openreview.net/pdf?id=3YKeB9R1g9

AI Scaling Prediction

Recommendations