FastV

It dynamically prunes unnecessary visual tokens.

The high redundancy in visual signals leads to the aggregation of image-related, instruction-specific features onto certain “anchor” tokens through the self-attention mechanism in the shallow layers. Notably, these anchor tokens are not image tokens. In deep layers, attentions are focused on those anchor tokens, leading to significantly reduced attention on the image tokens themselves.

FastV

pkunlp-icler • Updated 2024 Oct 25 7:39

arxiv.org

https://arxiv.org/pdf/2403.06764

FastV

It dynamically prunes unnecessary visual tokens.

Recommendations