Small (default 12 layers): 117M parametersReddit Data 40GBTask에 따른 Fine tuning 없이 기존 Task의 SOTA 모델들을 넘어섬잘 학습된 LLM 모델 하나로 모든 Task를 할 수 있을지도 모른다는 임팩트Small (default)117M12 layers12 attention heads per layer768 hidden dim Language Models are Unsupervised Multitask LearnersIlya Sutskever Alec Radford cdn.openai.comhttps://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdfReproducing GPT-2 (124M) in llm.c in 90 minutes for $20 · karpathy llm.c · Discussion #481Let's reproduce the GPT-2 (124M) in llm.c (~4,000 lines of C/CUDA) in 90 minutes for $20. The 124M model is the smallest model in the GPT-2 series released by OpenAI in 2019, and is actually qu...https://github.com/karpathy/llm.c/discussions/481Algpt2 Part 2Part 1: Best Practices for Finetuning Large Transformer Language models Part 2: How I (almost) replicated OpenAI's GPT-2 (124M version) small update (01/25/21): I posted this to HN and it ended up at the top of the front page for a couple of hours :O A few months ago I started working on a research project trying to pretrain my own, more efficient language model from scratch.https://bkkaggle.github.io/blog/algpt2/2020/07/17/ALGPT2-part-2.html#replicating-gpt-2