Transformer Inference

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2024 Feb 21 9:46
Editor
Edited
Edited
2024 Dec 13 16:18

Generate with fixed weight

  1. transform input string to token index ids
  1. make input embeddings from token embedding and
    Positional Embedding
    1. Dot Product 를 사용하면 두 벡터 간의 상관 관계를 계산하는 것으로 각 토큰의 의미와 위치 정보를 독립적으로 그리고 명확하게 모델에 전달하기 위해 더하기 연산
Transformer Block
  1. Layer Normalization
    for each input embeddings with QKV weight matrix
  1. Head 개수만큼 vector dimension 분해 (이때 weight도 학습한다)
  1. Self-Attention
    1. Make Q, K, V vectors from Q, K, V weights matrix
    2. Temperature with for skewness (because it is square matrix)
    3. projection because of
      Residual Connection
      (multi-head 적용하면 하나의 벡터가 아니라 분절된 벡터라 output projection 선형변환 거침) QK/VO
    4. Residual Connection
  1. FFNN
    Multi Layer Perceptron
    (sometimes share case for less weights)
    1. Layer Normalization
      with linear transformation
    2. NN (usually up projection → activation → down projection)
    3. Activation Function
      like
      GELU
    4. MLP projection because of
      Residual Connection
    5. MLP residual
  1. Layer Normalization with linear transformation
  1. make
    Logits
    from LM Head matrix through linear projection
  1. SoftMax Function
    and determine probabilities
    1. Its goal is to take a vector and normalize its values so that they sum to 1.0.
  1. sampling from the distribution (
    Beam search
    )
    1. A higher temperature will make the distribution more uniform, and a lower temperature will make it more concentrated on the highest probability tokens
    2. We do this by dividing the logits (the output of the linear transformation) by the temperature before applying the softmax.

transformer block

the earlier layers tend to focus on learning lower-level features and patterns, while the later layers learn to recognize and understand higher-level abstractions and relationships
In the context of natural language processing, the lower layers might learn grammar, syntax, and simple word associations, while the higher layers might capture more complex semantic relationships, discourse structures, and context-dependent meaning.
 
 
 

RNN inference idea

 
 

Recommendations