Visual Attention Sink

Created
Created
2025 Oct 20 23:55
Creator
Creator
Seonglae ChoSeonglae Cho
Editor
Edited
Edited
2026 Jun 26 14:57
The cause was identified as excessive activation in specific dimensions (D_sink) of the hidden state.
These tokens are semantically unrelated to the image and can be removed without affecting performance. Therefore, the paper proposes VAR (Visual Attention Redistribution), a technique to recycle unnecessary attention (attention budget redistribution).
  • Image-centric head selection (heads that focus on actual visual information)
  • Redistribute attention from sink tokens to valid non-visual sink tokens
Improved performance on general vision-language tasks, reduced visual hallucination, and enhanced vision-centric task performance.

Interpretation

Attention sink is viewed as an overactivation subspace in non-informative dimensions of the network. The paper also experimentally observed that it uses a shared feature basis with language sinks (the same dimensions are abnormally activated).
The token routing circuit, the structural pathway that determines which tokens attend to which tokens and where information flows, within the transformer is directly transferred to the visual modality as well.
 
 
arxiv.org
See What You Are Told: Visual Attention Sink in Large Multimodal Models
Large multimodal models (LMMs) "see" images by leveraging the attention mechanism between text and visual tokens in the transformer decoder. Ideally, these models should focus on key visual...
See What You Are Told: Visual Attention Sink in Large Multimodal Models
Object Hallucination because of the high norm
DAMRO: Dive into the Attention Mechanism of LVLM to Reduce Object...
Despite the great success of Large Vision-Language Models (LVLMs), they inevitably suffer from hallucination. As we know, both the visual encoder and the Large Language Model (LLM) decoder in...
DAMRO: Dive into the Attention Mechanism of LVLM to Reduce Object...
 
 

Recommendations