Transformer MLP

Transformer Feed Forward Network

Two layers are sufficient as this is the minimum required to ensure non-linearity. Additionally, the MLP hidden dimension is typically set to 4 times the residual dimension due to empirically proven efficiency trade-offs.

A FFN subvalue can help increase probabilities of tokens with largest logits.

it can reduce probabilities of tokens with smallest logits.

It can distinguish two tokens with different logits

There are tens of thousands of token pairs in a FFN subvalue, so one FFN subvalue can distinguish many tokens. Last, a FFN subvalue can be a “query” to activate other FFN subvalues.

Exploring the Residual Stream of Transformers for Mechanistic Interpretability — Explained

— by Zeping Yu, Dec 24, 2023

https://medium.com/@zepingyu/123-cb62513f5d50

Transformer Feed-Forward Layers Are Key-Value Memories

aclanthology.org

https://aclanthology.org/2021.emnlp-main.446.pdf

Factual GPT before mono-semanticity space autoencoder (some limitations)

arxiv.org

https://arxiv.org/pdf/2202.05262.pdf

Comparing GRPO(RL) and SFT using the same math data: GRPO shows small improvements and small deteriorations, while SFT shows large improvements and large deteriorations. Both methods affect Q/K weights the most. SFT causes much larger changes and significantly modifies middle layer MLPs. Hypothesis that middle MLPs serve as knowledge repositories → Freezing experiment: benefits some benchmarks (e.g., GPQA) but worsens others → Conclusion: results are inconclusive.

arxiv.org

https://arxiv.org/pdf/2507.10616

arxiv.org

https://arxiv.org/pdf/2507.10616

Transformer MLP

Transformer Feed Forward Network

Transformer Feed-Forward Layers Are Key-Value Memories

Factual GPT before mono-semanticity space autoencoder (some limitations)

Recommendations