Transformer Inference

Creator
Creator
Seonglae Cho
Created
Created
2024 Feb 21 9:46
Editor
Edited
Edited
2025 Jun 16 23:17

Generate with fixed weight

  1. Transform input string to token index ids
  1. Make input embeddings from token embedding and
    Positional Embedding
    1. Addition operation is used instead of dot product to independently and clearly convey both token meaning and position information to the model
Transformer Block
  1. Layer Normalization
    for each input embeddings with QKV weight matrix
  1. Decompose vector dimensions by number of heads (weights are learned during this process)
  1. Self-Attention
    1. Make Q, K, V vectors from Q, K, V weights matrix
    2. Temperature with dk\sqrt{d_k} for skewness (because it is square matrix)
    3. Projection because of
      Residual Connection
      (when multi-head is applied, linear transformation through output projection is needed as it's a segmented vector rather than a single vector) QK/VO
    4. Residual Connection
  1. FFNN
    Multi Layer Perceptron
    (sometimes share case for less weights)
    1. Layer Normalization
      with linear transformation
    2. NN (usually up projection → activation → down projection)
    3. Activation Function
      like
      GELU
    4. MLP projection because of
      Residual Connection
    5. MLP residual
  1. Layer Normalization with linear transformation
  1. Make
    Logits
    from LM Head matrix through linear projection
  1. Softmax Function
    and determine probabilities
    1. Its goal is to take a vector and normalize its values so that they sum to 1.0.
  1. Sampling from the distribution (
    Beam search
    )
    1. A higher temperature will make the distribution more uniform, and a lower temperature will make it more concentrated on the highest probability tokens
    2. We do this by dividing the logits (the output of the linear transformation) by the temperature before applying the softmax.

transformer block

the earlier layers tend to focus on learning lower-level features and patterns, while the later layers learn to recognize and understand higher-level abstractions and relationships
In the context of natural language processing, the lower layers might learn grammar, syntax, and simple word associations, while the higher layers might capture more complex semantic relationships, discourse structures, and context-dependent meaning.
 
 
 

RNN inference idea

 
 

Recommendations