Generate with fixed weight
- Transform input string to token index ids
- Make input embeddings from token embedding and Positional Embedding
- Addition operation is used instead of dot product to independently and clearly convey both token meaning and position information to the model
Transformer Block
- Layer Normalization for each input embeddings with QKV weight matrix
- Decompose vector dimensions by number of heads (weights are learned during this process)
- Self-Attention
- Make Q, K, V vectors from Q, K, V weights matrix
- Temperature with for skewness (because it is square matrix)
- Projection because of Residual Connection (when multi-head is applied, linear transformation through output projection is needed as it's a segmented vector rather than a single vector) QK/VO
- Residual Connection
- FFNN Multi Layer Perceptron (sometimes share case for less weights)
- Layer Normalization with linear transformation
- NN (usually up projection → activation → down projection)
- Activation Function like GELU
- MLP projection because of Residual Connection
- MLP residual
- Layer Normalization with linear transformation
- Make Logits from LM Head matrix through linear projection
- Softmax Function and determine probabilities
- Its goal is to take a vector and normalize its values so that they sum to 1.0.
- Sampling from the distribution (Beam search)
- A higher temperature will make the distribution more uniform, and a lower temperature will make it more concentrated on the highest probability tokens
- We do this by dividing the logits (the output of the linear transformation) by the temperature before applying the softmax.
transformer block
the earlier layers tend to focus on learning lower-level features and patterns, while the later layers learn to recognize and understand higher-level abstractions and relationships
In the context of natural language processing, the lower layers might learn grammar, syntax, and simple word associations, while the higher layers might capture more complex semantic relationships, discourse structures, and context-dependent meaning.