Transformer MLP

Creator
Creator
Seonglae Cho
Created
Created
2024 Apr 18 16:36
Editor
Edited
Edited
2025 Apr 14 21:46
Refs
Refs
AI Memory

Transformer Feed Forward Network

Two layers are sufficient as this is the minimum required to ensure non-linearity. Additionally, the MLP hidden dimension is typically set to 4 times the residual dimension due to empirically proven efficiency trade-offs.
https://arxiv.org/pdf/2203.14680.pdf
  • A FFN subvalue can help increase probabilities of tokens with largest logits.
  • it can reduce probabilities of tokens with smallest logits.
  • It can distinguish two tokens with different logits
There are tens of thousands of token pairs in a FFN subvalue, so one FFN subvalue can distinguish many tokens. Last, a FFN subvalue can be a “query” to activate other FFN subvalues.
 
 

Transformer Feed-Forward Layers Are Key-Value Memories

Factual GPT before mono-semanticity space autoencoder (some limitations)

 
 
 

Recommendations