AI Multimodal Reasoning

AI Spatial Reasoning Methods

MLLM show less than 50% accuracy in visually recognizing or systematically counting edges of even simple regular polygons, due to the vision encoder's 'shape-blind' phenomenon that prevents it from distinguishing rare shapes. The models rely only on intuition and memorization (

System 1 Thinking) without performing logical step-by-step reasoning (

System 2 Thinking). However, when applying Visually-Cued CoT prompts that label each shape's edges with numbers/characters and guide step-by-step, GPT-4v's accuracy in counting edges of irregular polygons dramatically improves from 7% to 93%.

arxiv.org

https://arxiv.org/pdf/2502.15969

Spatial reasoning platform | University of Surrey

We use cookies to help our site work, to understand how it is used, and to tailor ads that are more relevant to you and your interests.

https://www.surrey.ac.uk/spatial-reasoning

AI Multimodal Reasoning

Recommendations