PhysBench

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2026 Mar 23 0:58
Editor
Edited
Edited
2026 Mar 23 1:34
Refs
Refs
PhysBench consists of 10,002 interleaved video-image-text data points, divided into 4 main domains (physical object properties, object relationships, scene understanding, and physics-based dynamics), 19 subcategories, and 8 capability dimensions. Data collection includes web searches, simulations, and real-world footage, with a total of 4,000 hours of manual annotation. Each question is multiple-choice with four options and one correct answer.
Large-scale experiments on 75 representative VLMs showed that most models achieved only about 40% accuracy on average, with the best-performing model GPT-4o reaching 49.49%, far below human-level performance (95.87%). Closed-source models significantly outperformed open-source models, with GPT-4o exceeding the top open-source model LLaVA-interleave by 20.7%. Performance was particularly low in physical scene understanding and dynamics.
An interesting finding is that increases in model size, data volume, and frame count do not lead to improved PhysBench performance.

PhysAgent

  1. Task-specific Prompt Activation - Classifies questions and retrieves relevant physical knowledge from knowledge memory to include in prompts
  1. Foundation Models Integration - Leverages outputs from vision foundation models like Depth Anything, SAM, and GroundingDINO to enhance perceptual capabilities where VLMs are weak, including depth estimation, object detection, and segmentation
  1. Chain-of-Thoughts Reasoning - Performs step-by-step reasoning and self-verification to ensure logical consistency
 
 
 
arxiv.org
 
 

Recommendations