Small (default 12 layers): 117M parameters
Reddit Data 40GB
Task에 따른 Fine tuning 없이 기존 Task의 SOTA 모델들을 넘어섬
잘 학습된 LLM 모델 하나로 모든 Task를 할 수 있을지도 모른다는 임팩트


Small (default)
- 117M
- 12 layers
- 12 attention heads per layer
- 768 hidden dim
Language Models are Unsupervised Multitask Learners
cdn.openai.com
https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
Reproducing GPT-2 (124M) in llm.c in 90 minutes for $20 · karpathy llm.c · Discussion #481
Let's reproduce the GPT-2 (124M) in llm.c (~4,000 lines of C/CUDA) in 90 minutes for $20. The 124M model is the smallest model in the GPT-2 series released by OpenAI in 2019, and is actually qu...
https://github.com/karpathy/llm.c/discussions/481
Algpt2 Part 2
Part 1: Best Practices for Finetuning Large Transformer Language models Part 2: How I (almost) replicated OpenAI's GPT-2 (124M version) small update (01/25/21): I posted this to HN and it ended up at the top of the front page for a couple of hours :O A few months ago I started working on a research project trying to pretrain my own, more efficient language model from scratch.
https://bkkaggle.github.io/blog/algpt2/2020/07/17/ALGPT2-part-2.html#replicating-gpt-2

Seonglae Cho