One Wide Feedforward is All You Need

Attention captures interdependencies between words regardless of their position, while the FFN non-linearly transforms each input token independently.

By removing the FFN on the decoder layers and sharing a single FFN across the encoder, achieving substantial gains in both accuracy and latency with respect to the original Transformer Big.

The Transformer architecture has two main non-embedding components: Attention and the Feed Forward Network (FFN). Attention captures interdependencies between words regardless of their position,...

https://arxiv.org/abs/2309.01826

One Wide Feedforward is All You Need

Recommendations