Attention captures interdependencies between words regardless of their position, while the FFN non-linearly transforms each input token independently.
By removing the FFN on the decoder layers and sharing a single FFN across the encoder, achieving substantial gains in both accuracy and latency with respect to the original Transformer Big.