Transformer Feed Forward Network
Two layers are sufficient as this is the minimum required to ensure non-linearity. Additionally, the MLP hidden dimension is typically set to 4 times the residual dimension due to empirically proven efficiency trade-offs.

- A FFN subvalue can help increase probabilities of tokens with largest logits.
- it can reduce probabilities of tokens with smallest logits.
- It can distinguish two tokens with different logits
There are tens of thousands of token pairs in a FFN subvalue, so one FFN subvalue can distinguish many tokens. Last, a FFN subvalue can be a “query” to activate other FFN subvalues.