Cross Entropy
The loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude. Other architectural details such as network width or depth have minimal effects within a wide range. These relationships allow us to determine the optimal allocation of a fixed compute budget.
Larger models has significantly more Sample efficiency, such that optimally compute-efficient training involves training very large models on a relatively modest amount of data and stopping significantly before convergence.
N - parameters, D - dataset size, C – compute
- Larger models are more sample efficient
- Transfer improves with test performance
It is feasible when increasing N and D simultaneously but it has capacity when one the them is fixed.
Larger model more easily achieves higher performance compared to smaller model which means has better Sample efficiency. Consequently, it is good to scale model size than data size when the computing cost is constant.
Dataset
OpenAI
Scaling Laws for Neural Language Models
Deepmind
UNIFIED SCALING LAWS FOR ROUTED LANGUAGE MODELS
Training Compute-Optimal Large Language Models (Chinchilla Scaling )