Transformer MLP

Creator
Creator
Seonglae Cho
Created
Created
2024 Apr 18 16:36
Editor
Edited
Edited
2025 Jul 22 18:20
Refs
Refs
AI Memory

Transformer Feed Forward Network

Two layers are sufficient as this is the minimum required to ensure non-linearity. Additionally, the MLP hidden dimension is typically set to 4 times the residual dimension due to empirically proven efficiency trade-offs.
https://arxiv.org/pdf/2203.14680.pdf
  • A FFN subvalue can help increase probabilities of tokens with largest logits.
  • it can reduce probabilities of tokens with smallest logits.
  • It can distinguish two tokens with different logits
There are tens of thousands of token pairs in a FFN subvalue, so one FFN subvalue can distinguish many tokens. Last, a FFN subvalue can be a “query” to activate other FFN subvalues.
 
 

Transformer Feed-Forward Layers Are Key-Value Memories

Factual GPT before mono-semanticity space autoencoder (some limitations)

Comparing GRPO(RL) and SFT using the same math data: GRPO shows small improvements and small deteriorations, while SFT shows large improvements and large deteriorations. Both methods affect Q/K weights the most. SFT causes much larger changes and significantly modifies middle layer MLPs. Hypothesis that middle MLPs serve as knowledge repositories → Freezing experiment: benefits some benchmarks (e.g., GPQA) but worsens others → Conclusion: results are inconclusive.
Comparing GRPO(RL) and SFT using the same math data: GRPO shows small improvements and small deteriorations, while SFT shows large improvements and large deteriorations. Both methods affect Q/K weights the most. SFT causes much larger changes and significantly modifies middle layer MLPs. Hypothesis that middle MLPs serve as knowledge repositories → Freezing experiment: benefits some benchmarks (e.g., GPQA) but worsens others → Conclusion: results are inconclusive.
 
 

Recommendations