Using different weight and extracting common attention score to canceling irrelevant context
Differential Transformer
Transformer tends to overallocate attention to irrelevant context. In this work, we introduce Diff Transformer, which amplifies attention to the relevant context while canceling noise....
https://arxiv.org/abs/2410.05258


Seonglae Cho