Multimodal Attention Head

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Oct 22 22:25
Editor
Edited
Edited
2025 Oct 22 22:36
 
 
 
 
 
Image→text information flow occurs through distributed cooperation of multiple heads, and key heads cannot be identified through single head ablation.
Attention Weight
and head importance have low correlation → cannot be interpreted simply by weight.
Objects with the same semantic category use similar head combinations. Image information is mainly transmitted through role tokens such as 'ASSISTANT' and ':'. Among image tokens, only some object tokens and a few background tokens actually contribute to information transmission.

Localization head

visual locationsLocalization heads are automatically selected based on two criteria: Attention Sum of
Attention Weight
and Spatial Entropy. By simply combining the attention maps of these heads and converting them into masks/boxes,
Visual Grounding
can be accurately predicted without any training. This means LVLMs inherently understand language-image relationships, and some attention heads implicitly encode visual location information.
 

Recommendations