The cause was identified as excessive activation in specific dimensions (D_sink) of the hidden state.
These tokens are semantically unrelated to the image and can be removed without affecting performance. Therefore, the paper proposes VAR (Visual Attention Redistribution), a technique to recycle unnecessary attention (attention budget redistribution).
- Image-centric head selection (heads that focus on actual visual information)
- Redistribute attention from sink tokens to valid non-visual sink tokens
Improved performance on general vision-language tasks, reduced visual hallucination, and enhanced vision-centric task performance.
Interpretation
Attention sink is viewed as an overactivation subspace in non-informative dimensions of the network. The paper also experimentally observed that it uses a shared feature basis with language sinks (the same dimensions are abnormally activated).
The token routing circuit, the structural pathway that determines which tokens attend to which tokens and where information flows, within the transformer is directly transferred to the visual modality as well.

Seonglae Cho