Transformer Feed Forward Network
Two layers are sufficient as this is the minimum required to ensure non-linearity. Additionally, the MLP hidden dimension is typically set to 4 times the residual dimension due to empirically proven efficiency trade-offs.

- A FFN subvalue can help increase probabilities of tokens with largest logits.
- it can reduce probabilities of tokens with smallest logits.
- It can distinguish two tokens with different logits
There are tens of thousands of token pairs in a FFN subvalue, so one FFN subvalue can distinguish many tokens. Last, a FFN subvalue can be a “query” to activate other FFN subvalues.
Transformer Feed-Forward Layers Are Key-Value Memories
Factual GPT before mono-semanticity space autoencoder (some limitations)
Comparing GRPO(RL) and SFT using the same math data: GRPO shows small improvements and small deteriorations, while SFT shows large improvements and large deteriorations. Both methods affect Q/K weights the most. SFT causes much larger changes and significantly modifies middle layer MLPs. Hypothesis that middle MLPs serve as knowledge repositories → Freezing experiment: benefits some benchmarks (e.g., GPQA) but worsens others → Conclusion: results are inconclusive.
Comparing GRPO(RL) and SFT using the same math data: GRPO shows small improvements and small deteriorations, while SFT shows large improvements and large deteriorations. Both methods affect Q/K weights the most. SFT causes much larger changes and significantly modifies middle layer MLPs. Hypothesis that middle MLPs serve as knowledge repositories → Freezing experiment: benefits some benchmarks (e.g., GPQA) but worsens others → Conclusion: results are inconclusive.