MEGA Model

Creator
Creator
Seonglae Cho
Created
Created
2024 Apr 28 8:55
Editor
Edited
Edited
2024 Apr 28 9:27
Refs
Refs

Moving Average Equipped Gated Attention

Single-head gated attention

It is interesting to see that Mega with full attention field is also much more efficient than Transformer, benefiting from single-head gated attention.
notion image
 

Exponential moving average

EMA captures local dependencies that exponentially decay over time with learnable coefficient
 
 
 
 
 

Recommendations