MEGA Model

Creator

Seonglae Cho

Created

2024 Apr 28 8:55

Editor

Seonglae Cho

Edited

2024 Apr 28 9:27

Refs

Moving Average Equipped Gated Attention

Single-head gated attention

It is interesting to see that Mega with full attention field is also much more efficient than Transformer, benefiting from single-head gated attention.

Exponential moving average

EMA captures local dependencies that exponentially decay over time with learnable coefficient

arxiv.org

https://arxiv.org/pdf/2209.10655

Recommendations

///////////