One Wide Feedforward is All You Need

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2023 Sep 16 17:14
Editor
Edited
Edited
2023 Sep 16 17:15
Refs
Refs
Attention captures interdependencies between words regardless of their position, while the FFN non-linearly transforms each input token independently.
By removing the FFN on the decoder layers and sharing a single FFN across the encoder, achieving substantial gains in both accuracy and latency with respect to the original Transformer Big.
 
 
 
 
 
 
 
 
 
 

Recommendations