20 tokens per parameter
Previous models are undertrained (5 tokens per parameter) where represents the compute budget and represents the number of model parameters.
Optimal amount of data a model of a given size should be trained on
Larger models require more data
- Brain size corresponds to model size
- Learning period before adulthood corresponds to training data size
According to the Chinchilla paper, the human brain is most efficient with millions of years of learning
However, in nature, due to external factors, humans and other living organisms face a trade-off between resources used for the brain and resources needed for survival. The expected future gains from learning decrease exponentially while surviving during the learning period, so they cannot have a sufficient training period.
In modern society, the expected return on learning period is higher (though the correlation between intelligence and income is still low from top performers to average), and the burden of learning has decreased compared to natural conditions, so the learning period is increasing.
For robots, this selective pressure on intelligence does not apply, so only the linear computational cost of learning is incurred.
from DeepMind
Training Compute-Optimal Large Language Models (Chinchilla Scaling )

Seonglae Cho

