AI Multimodal Reasoning

Creator
Creator
Seonglae Cho
Created
Created
2025 Mar 20 10:12
Editor
Edited
Edited
2025 Apr 21 16:35
Refs
Refs
AI Spatial Reasoning Methods
 
 
 
 
MLLM show less than 50% accuracy in visually recognizing or systematically counting edges of even simple regular polygons, due to the vision encoder's 'shape-blind' phenomenon that prevents it from distinguishing rare shapes. The models rely only on intuition and memorization (
System 1 Thinking
) without performing logical step-by-step reasoning (
System 2 Thinking
). However, when applying Visually-Cued CoT prompts that label each shape's edges with numbers/characters and guide step-by-step, GPT-4v's accuracy in counting edges of irregular polygons dramatically improves from 7% to 93%.
 
 

 

Recommendations