Transformer Feed-Forward Layer

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2024 Apr 18 16:36
Editor
Edited
Edited
2024 Apr 19 14:34
Refs
https://arxiv.org/pdf/2203.14680.pdf
  • A FFN subvalue can help increase probabilities of tokens with largest logits.
  • it can reduce probabilities of tokens with smallest logits.
  • It can distinguish two tokens with different logits
There are tens of thousands of token pairs in a FFN subvalue, so one FFN subvalue can distinguish many tokens. Last, a FFN subvalue can be a “query” to activate other FFN subvalues.
 
 

Transformer Feed-Forward Layers Are Key-Value Memories

Factual GPT before mono-semanticity space autoencoder (some limitations)

 
 
 

Recommendations