FastV

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2024 Oct 26 13:32
Editor
Edited
Edited
2024 Oct 26 13:51
Refs
Refs

It dynamically prunes unnecessary visual tokens.

The high redundancy in visual signals leads to the aggregation of image-related, instruction-specific features onto certain “anchor” tokens through the self-attention mechanism in the shallow layers. Notably, these anchor tokens are not image tokens. In deep layers, attentions are focused on those anchor tokens, leading to significantly reduced attention on the image tokens themselves.
 
 
 
 
 

Recommendations