Bidirectional LM

Bi-directional because it is mathematically equivalent to the sum of the two unidirectional representation form.

결국에 이전 단어 예측이 computing에 포함되어 있는데, 앞 단어랑 뒷 단어가 연결되어 있을 뿐이지, 인간이 글을 거꾸로 읽지는 않기 때문에

Emergent ability 에 과한 restriction이라 여겨진다.

Encoder Model Auto-encoding model (uses Transformer Encoder block)

All it means that it is an encoder block is that you will delete this diagonal line of code -

Andrej Karpathy

Bi-directional’s induced incentive is too strict for scaling than

Causal language model

transform text or images into a condensed numerical representation called an embedding

These models are often characterized as having bi-directional attention

입력 문장을 벡터화하고 다시 복원하는 모델

입력 문장의 의미를 보존하면서, 모델이 처리하기 쉬운 형태로 변환

The pretraining of these models usually revolves around somehow corrupting a given sentence and tasking the model with finding or reconstructing the initial sentence

auto-encoding

입력 시퀀스를 인코딩하여 전부

Latent space 에 매핑한다

입력 시퀀스를 토큰 단위로 처리하지만 Self-Attention으로 입력 시퀀스의 각 위치마다 해당 위치와 다른 모든 위치 간의 유사도를 계산하기 때문에 문맥고려 가능하다. 이때 유사도가 높은 위치들은 해당 위치의 임베딩 벡터에 더 많은 가중치가 부여

Token type Embedding

BERT 모델에서는 두 개의 문장을 입력으로 받을 때, 첫 번째 문장의 토큰에는 0을, 두 번째 문장의 토큰에는 1을 할당하여 두 문장을 구분

Masked Language model이라고 불리는 이유는 사이에 낀 단어 맞추기로 쓰여서 (autoregressive masking과 다름)

MLM (masking rate and strategy)

80-10-10 corruption

MLM takes slightly longer to converge because it only predicts 15% of tokens

Masked language modeling

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

https://huggingface.co/docs/transformers/tasks/masked_language_modeling

Some Intuition on Attention and the Transformer

What's the big deal, intuition on query-key-value vectors, multiple heads, multiple layers, and more.

https://eugeneyan.com/writing/attention/

Encoder models - Hugging Face NLP Course

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

https://huggingface.co/learn/nlp-course/chapter1/5?fw=pt