Image→text information flow occurs through distributed cooperation of multiple heads, and key heads cannot be identified through single head ablation. Attention Weight and head importance have low correlation → cannot be interpreted simply by weight.
Objects with the same semantic category use similar head combinations. Image information is mainly transmitted through role tokens such as 'ASSISTANT' and ':'. Among image tokens, only some object tokens and a few background tokens actually contribute to information transmission.
Localization head
visual locationsLocalization heads are automatically selected based on two criteria: Attention Sum of Attention Weight and Spatial Entropy. By simply combining the attention maps of these heads and converting them into masks/boxes, Visual Grounding can be accurately predicted without any training. This means LVLMs inherently understand language-image relationships, and some attention heads implicitly encode visual location information.

Seonglae Cho