Moving Average Equipped Gated Attention
Single-head gated attention
It is interesting to see that Mega with full attention field is also much more efficient than Transformer, benefiting from single-head gated attention.

Exponential moving average
EMA captures local dependencies that exponentially decay over time with learnable coefficient