Chinchilla Scaling

Creator

Creator

Created

Created

2023 Jun 25 6:56

Editor

Editor

Edited

Edited

2024 Dec 9 15:23

Refs

Refs

20 tokens per parameter

Previous models are undertrained (5 tokens per parameter)

특정 크기의 모델이 학습해야 하는 최적의 데이터 양

더 큰 모델을 더 많은 데이터를 요구한다

뇌 크기는 모델 크기

어른이 되기 이전까지의 학습 기간이 훈련 데이터 크기

친칠라 논문에 따르면 인간 뇌는 수백만년 학습에 가장 효율적

하지만 자연계에서 즉 외부적인 요인으로 인간을 비롯한 생명체들은 뇌에 사용하는 자원과 생존에 필요한 자원이 trade off 관계에 있고 학습기간에 살아남으면서 기대되는 학습으로 인한 추후 이득이 지수적으로 감소하기 때문에 충분한 훈련 기간을 가지지 못하다는

현대사회에서는 학습가간에 대한 리턴 기대치가 높고 (여전히 최상위권에서 평균 근처에서는 지능과 income에 대한 상관관계가 낮다) 학습에 대한 부담이 자연상태보다 줄었기 때문에 학습기간이 늘어나고 있다

로봇에게는 이 지능에 대한 선택적 압력이 적용되지 않기 때문에 학습에 대한 선형적 컴퓨팅 비용만이 소요된다

from DeepMind

An empirical analysis of compute-optimal large language model training

We investigate the optimal model and dataset size for training a transformer language model under a given compute budget. We find that current large language models are significantly undertrained, a consequence of the recent focus on scaling language models whilst keeping the amount of training data constant. By training \nummodels language models ranging from 70 million to 10 billion parameters on 5 to 500 billion tokens, we find that for compute-optimal training, the model size and the training dataset size should be scaled equally: for every doubling of model size the training dataset size should also be doubled. We test this hypothesis by training a more compute-optimal model, \Chinchilla, using the same compute budget as \gopher but with 70B parameters and 4$\times$ more data. \chinchilla uniformly and significantly outperforms \Gopher, GPT-3, Jurassic-1, and \mtnlg on a large range of downstream evaluation tasks. As a highlight, \chinchilla reaches an average accuracy of 67.5\% on the MMLU benchmark, over a 7\% improvement over \gopher.

An empirical analysis of compute-optimal large language model training

https://www.deepmind.com/publications/an-empirical-analysis-of-compute-optimal-large-language-model-training

컴퓨팅의 의미, 인간의 지능은 특별한가 - Carl Shulman

일리야 수츠케버는 GPU는 새로운 비트코인이다고 말했고, 일론 머스크는 최근 GPU를 구하는 것이 마약보다 어렵다고 말했습니다. 왜 모든 기업들이 ASIC 칩을 준비하고, 모두가 GPU를 주문하며 컴퓨팅 규모를 키우는 것일까요? 정말 미래의 모델 성능은 실제 사회에 투입이 가능한 지점까지 갈 수 있을까요? 인간만이 가지는 단계는 없는 것일까요. 만약 그렇다면, 인간의 지능은 특별한 것일까요. 지능이 인간을 행성의 지배자로 만들어줄만큼 대단한 것이었다면, 왜 다른 동물들은 더 똑똑한 방향으로 진화하지 못한 것일까요. 왜 인간은 더욱 똑똑해지지 못한 것일까요. 이 모든 이야기를 담은 아주 흥미로운 인터뷰를 Dwarkesh Patel이 진행하는 The Lunar Society에서 담아냈습니다. Carl Shulman은 옥스포드의 Future of Humanity Institute 연구원이자 Open Philanthropy Project의 어드바이저이며 이전에는 기계지능연구소(MIRI, Machine Intelligence Research Institute)에 몸담았던 사람입니다. 닉 보스트롬과 Propositions Concerning Digital Minds and Society라는, 고도의 AI가 사회에 통합되는 과정에 대한 논문을 쓰기도 한 연구자입니다. 우리의 관점을 바꿔줄만큼 흥미로운 인터뷰, 함께 들어볼까요? https://www.youtube.com/watch?v=_kRg-ZP1vQc&t=6533s https://www.youtube.com/@UCXl4i9dYBrFOabk0xGmbkRA

컴퓨팅의 의미, 인간의 지능은 특별한가 - Carl Shulman

https://www.youtube.com/watch?v=nbai4z06Z4w

컴퓨팅의 의미, 인간의 지능은 특별한가 - Carl Shulman

Neural scaling law

In machine learning, a neural scaling law is a scaling law relating parameters of a family of neural networks.[1][2]

Neural scaling law

https://en.wikipedia.org/wiki/Neural_scaling_law

Training Compute-Optimal Large Language Models (

Chinchilla Scaling )

https://arxiv.org/pdf/2203.15556

Backlinks

Scaling Law AI Scaling Chinchilla Scaling

Recommendations

//////