Language-conditioned object detection/segmentation
The finding that VLMs use visual space as a content-independent scaffold—functioning like an abstract symbolic variable—offers a new direction for diagnosing the causes of visual grounding failures and for future VLM design.
Visual symbolic mechanisms: Emergent symbol processing in Vision...
To accurately process a visual scene, observers must bind features together to represent individual objects. This capacity is necessary, for instance, to distinguish an image containing a red...
https://openreview.net/forum?id=3RQ863cRbx

Seonglae Cho