One Wide Feedforward is All You Need

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2023 Sep 16 17:14
Editor
Edited
Edited
2023 Sep 16 17:15
Refs
Refs
Attention captures interdependencies between words regardless of their position, while the FFN non-linearly transforms each input token independently.
By removing the FFN on the decoder layers and sharing a single FFN across the encoder, achieving substantial gains in both accuracy and latency with respect to the original Transformer Big.
 
 
 
 
 
 
 
 
One Wide Feedforward is All You Need
The Transformer architecture has two main non-embedding components: Attention and the Feed Forward Network (FFN). Attention captures interdependencies between words regardless of their position,...
One Wide Feedforward is All You Need
 
 

Recommendations