Neural Scaling law with Cross Entropy
The loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude. Other architectural details such as network width or depth have minimal effects within a wide range. These relationships allow us to determine the optimal allocation of a fixed compute budget.
Larger models has significantly more Sample efficiency, such that optimally compute-efficient training involves training very large models on a relatively modest amount of data and stopping significantly before convergence.
Kaplan-style power law
N - parameters, D - dataset size, C – compute

- Larger models are more sample efficient
- Transfer improves with test performance



It is feasible when increasing N and D simultaneously but it has capacity when one the them is fixed.

Larger model more easily achieves higher performance compared to smaller model which means has better Sample efficiency. Consequently, it is good to scale model size than data size when the computing cost is constant.
Dataset

Latent size Chinchilla Scaling
OpenAI
Scaling Laws for Neural Language Models
arxiv.org
https://arxiv.org/pdf/2001.08361
Deepmind
UNIFIED SCALING LAWS FOR ROUTED LANGUAGE MODELS
arxiv.org
https://arxiv.org/pdf/2202.01169
Training Compute-Optimal Large Language Models (Chinchilla Scaling )
arxiv.org
https://arxiv.org/pdf/2203.15556
Wikipedia
Neural scaling law
In machine learning, a neural scaling law is an empirical scaling law that describes how neural network performance changes as key factors are scaled up or down. These factors typically include the number of parameters, training dataset size,[1][2] and training cost.
https://en.wikipedia.org/wiki/Neural_scaling_law
llm only?
A 10× scale-up yields about a 32% performance improvement for LLMs, whereas robotics and biology domains see only around ~15%.
Using examples like Wayve's autonomous-driving data generation engine and Basecamp Research's advanced biological data supply chain, this highlights the importance of a "data engine".
The Fickleness of Scaling Laws
Or, the Foundations of Foundation Models Nothing has defined tech in the last five years more than scaling laws. LLMs demonstrated consistent returns to scaling to dumbfounding sizes and cracked the positive feedback loop of scale to larger size to make a more useful product to raise and make more
https://www.mackenziemorehead.com/the-fickleness-of-scaling-laws/


Seonglae Cho