OpenAI GPT (Radford et al., 2018; released in 2018/6)
12개의 Transformer Decoder layers
모델의 크기가 Pretrained Model의 성능에 영향을 준다라는 시사점
FFNN 대신에 Conv1D 를 사용했는데, 대규모학습시 좋다고 알려져 있다.
Improving language understanding with unsupervised learning
We’ve obtained state-of-the-art results on a suite of diverse language tasks with a scalable, task-agnostic system, which we’re also releasing. Our approach is a combination of two existing ideas: transformers and unsupervised pre-training. These results provide a convincing example that pairing supervised learning methods with unsupervised pre-training works very well; this is an idea that many have explored in the past, and we hope our result motivates further research into applying this idea on larger and more diverse datasets.
https://openai.com/index/language-unsupervised/

cdn.openai.com
https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf

Seonglae Cho