Attention captures interdependencies between words regardless of their position, while the FFN non-linearly transforms each input token independently.
By removing the FFN on the decoder layers and sharing a single FFN across the encoder, achieving substantial gains in both accuracy and latency with respect to the original Transformer Big.
One Wide Feedforward is All You Need
The Transformer architecture has two main non-embedding components: Attention and the Feed Forward Network (FFN). Attention captures interdependencies between words regardless of their position,...
https://arxiv.org/abs/2309.01826


Seonglae Cho