SoM Prompting

Creator

Creator

Seonglae Cho

Created

Created

2025 Nov 12 15:42

Editor

Editor

Seonglae Cho

Edited

Edited

2026 May 15 17:46

Refs

Refs

On a webpage or image,

attach “marks” (numbers, letters, boxes, etc.)

to each DOM element / object / region

so that a multimodal model like GPT-4V can achieve better spatial grounding.

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

We present Set-of-Mark (SoM), a new visual prompting method, to unleash the visual grounding abilities of large multimodal models (LMMs), such as GPT-4V. As illustrated in Fig. 1 (right), we...

https://arxiv.org/abs/2310.11441

Recommendations

///////