- On a webpage or image,
- attach “marks” (numbers, letters, boxes, etc.)
- to each DOM element / object / region
- so that a multimodal model like GPT-4V can achieve better spatial grounding.
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
We present Set-of-Mark (SoM), a new visual prompting method, to unleash the visual grounding abilities of large multimodal models (LMMs), such as GPT-4V. As illustrated in Fig. 1 (right), we...
https://arxiv.org/abs/2310.11441


Seonglae Cho