Multimodal Attention Head

Image→text information flow occurs through distributed cooperation of multiple heads, and key heads cannot be identified through single head ablation.

Attention Weight and head importance have low correlation → cannot be interpreted simply by weight.

Objects with the same semantic category use similar head combinations. Image information is mainly transmitted through role tokens such as 'ASSISTANT' and ':'. Among image tokens, only some object tokens and a few background tokens actually contribute to information transmission.

arxiv.org

https://arxiv.org/pdf/2509.17588

Localization head

visual locationsLocalization heads are automatically selected based on two criteria: Attention Sum of

Attention Weight and Spatial Entropy. By simply combining the attention maps of these heads and converting them into masks/boxes,

Visual Grounding can be accurately predicted without any training. This means LVLMs inherently understand language-image relationships, and some attention heads implicitly encode visual location information.

arxiv.org

https://arxiv.org/pdf/2503.06287

Multimodal Attention Head

Localization head

Backlinks

Recommendations