Transformer Inference

Generate with fixed weight

Transform input string to token index ids

Make input embeddings from token embedding and
Positional Embedding

Addition operation is used instead of dot product to independently and clearly convey both token meaning and position information to the model

Transformer Block

Layer Normalization for each input embeddings with QKV weight matrix

Decompose vector dimensions by number of heads (weights are learned during this process)

Self-Attention

Make Q, K, V vectors from Q, K, V weights matrix
Temperature with $\sqrt{d_k}$ for skewness (because it is square matrix)
Projection because of
Residual Connection (when multi-head is applied, linear transformation through output projection is needed as it's a segmented vector rather than a single vector) QK/VO
Residual Connection

FFNN
Multi Layer Perceptron (sometimes share case for less weights)

Layer Normalization with linear transformation
NN (usually up projection → activation → down projection)
Activation Function like
GELU
MLP projection because of
Residual Connection
MLP residual

Layer Normalization with linear transformation

Make
Logits from LM Head matrix through linear projection

Softmax Function and determine probabilities

Its goal is to take a vector and normalize its values so that they sum to 1.0.

Sampling from the distribution (
Beam search)

A higher temperature will make the distribution more uniform, and a lower temperature will make it more concentrated on the highest probability tokens
We do this by dividing the logits (the output of the linear transformation) by the temperature before applying the softmax.

transformer block

the earlier layers tend to focus on learning lower-level features and patterns, while the later layers learn to recognize and understand higher-level abstractions and relationships

In the context of natural language processing, the lower layers might learn grammar, syntax, and simple word associations, while the higher layers might capture more complex semantic relationships, discourse structures, and context-dependent meaning.

RNN inference idea

Attention as an RNN

The advent of Transformers marked a significant breakthrough in sequence modelling, providing a highly performant architecture capable of leveraging GPU parallelism. However, Transformers are...

https://arxiv.org/abs/2405.13956

Transformer Inference

Generate with fixed weight

transformer block

RNN inference idea

Backlinks

Recommendations