Scaling Law

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2024 May 24 4:21
Editor
Edited
Edited
2024 Sep 12 16:55

Cross Entropy

The loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude. Other architectural details such as network width or depth have minimal effects within a wide range. These relationships allow us to determine the optimal allocation of a fixed compute budget.
Larger models has significantly more
Sample efficiency
, such that optimally compute-efficient training involves training very large models on a relatively modest amount of data and stopping significantly before convergence.
N - parameters, D - dataset size, C – compute
notion image
  • Larger models are more sample efficient
  • Transfer improves with test performance
notion image
notion image
Single layer transformer doesn’t have  ability
Single layer transformer doesn’t have
In-context learning
ability
It is feasible when increasing N and D simultaneously but it has capacity when one the them is fixed.
notion image
Larger model more easily achieves higher performance compared to smaller model which means has better
Sample efficiency
. Consequently, it is good to scale model size than data size when the computing cost is constant.

Dataset

notion image
 
 
 

OpenAI

Scaling Laws for Neural Language Models

Deepmind

UNIFIED SCALING LAWS FOR ROUTED LANGUAGE MODELS
Training Compute-Optimal Large Language Models (
Chinchilla Scaling
)

Wikipedia

 
 

Recommendations