Scaling Law

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2024 May 24 4:21
Editor
Edited
Edited
2026 Jun 16 10:55

Neural Scaling law with
Cross Entropy

The loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude. Other architectural details such as network width or depth have minimal effects within a wide range. These relationships allow us to determine the optimal allocation of a fixed compute budget.
Larger models has significantly more
Sample efficiency
, such that optimally compute-efficient training involves training very large models on a relatively modest amount of data and stopping significantly before convergence.

Kaplan-style power law

N - parameters, D - dataset size, C – compute
notion image
  • Larger models are more sample efficient
  • Transfer improves with test performance
notion image
notion image
Single layer transformer doesn’t have  ability
Single layer transformer doesn’t have
In-context learning
ability
It is feasible when increasing N and D simultaneously but it has capacity when one the them is fixed.
notion image
Larger model more easily achieves higher performance compared to smaller model which means has better
Sample efficiency
. Consequently, it is good to scale model size than data size when the computing cost is constant.

Dataset

notion image

Latent size
Chinchilla Scaling

 
 

OpenAI

Scaling Laws for Neural Language Models
arxiv.org

Deepmind

UNIFIED SCALING LAWS FOR ROUTED LANGUAGE MODELS
arxiv.org
Training Compute-Optimal Large Language Models (
Chinchilla Scaling
)
arxiv.org

Wikipedia

Neural scaling law
In machine learning, a neural scaling law is an empirical scaling law that describes how neural network performance changes as key factors are scaled up or down. These factors typically include the number of parameters, training dataset size,[1][2] and training cost.

llm only?

A 10× scale-up yields about a 32% performance improvement for LLMs, whereas robotics and biology domains see only around ~15%.
Using examples like Wayve's autonomous-driving data generation engine and Basecamp Research's advanced biological data supply chain, this highlights the importance of a "data engine".
The Fickleness of Scaling Laws
Or, the Foundations of Foundation Models Nothing has defined tech in the last five years more than scaling laws. LLMs demonstrated consistent returns to scaling to dumbfounding sizes and cracked the positive feedback loop of scale to larger size to make a more useful product to raise and make more
The Fickleness of Scaling Laws
 
 

Recommendations