Generate with fixed weight
- transform input string to token index ids
- make input embeddings from token embedding and Positional Embedding
- Dot Product 를 사용하면 두 벡터 간의 상관 관계를 계산하는 것으로 각 토큰의 의미와 위치 정보를 독립적으로 그리고 명확하게 모델에 전달하기 위해 더하기 연산
Transformer Block
- Layer Normalization for each input embeddings with QKV weight matrix
- Head 개수만큼 vector dimension 분해 (이때 weight도 학습한다)
- Self-Attention
- Make Q, K, V vectors from Q, K, V weights matrix
- Temperature with for skewness (because it is square matrix)
- projection because of Residual Connection (multi-head 적용하면 하나의 벡터가 아니라 분절된 벡터라 output projection 선형변환 거침) QK/VO
- Residual Connection
- FFNN Multi Layer Perceptron (sometimes share case for less weights)
- Layer Normalization with linear transformation
- NN (usually up projection → activation → down projection)
- Activation Function like GELU
- MLP projection because of Residual Connection
- MLP residual
- Layer Normalization with linear transformation
- make Logits from LM Head matrix through linear projection
- SoftMax Function and determine probabilities
- Its goal is to take a vector and normalize its values so that they sum to 1.0.
- sampling from the distribution (Beam search)
- A higher temperature will make the distribution more uniform, and a lower temperature will make it more concentrated on the highest probability tokens
- We do this by dividing the logits (the output of the linear transformation) by the temperature before applying the softmax.
transformer block
the earlier layers tend to focus on learning lower-level features and patterns, while the later layers learn to recognize and understand higher-level abstractions and relationships
In the context of natural language processing, the lower layers might learn grammar, syntax, and simple word associations, while the higher layers might capture more complex semantic relationships, discourse structures, and context-dependent meaning.