PhysBench

PhysBench consists of 10,002 interleaved video-image-text data points, divided into 4 main domains (physical object properties, object relationships, scene understanding, and physics-based dynamics), 19 subcategories, and 8 capability dimensions. Data collection includes web searches, simulations, and real-world footage, with a total of 4,000 hours of manual annotation. Each question is multiple-choice with four options and one correct answer.

Large-scale experiments on 75 representative VLMs showed that most models achieved only about 40% accuracy on average, with the best-performing model GPT-4o reaching 49.49%, far below human-level performance (95.87%). Closed-source models significantly outperformed open-source models, with GPT-4o exceeding the top open-source model LLaVA-interleave by 20.7%. Performance was particularly low in physical scene understanding and dynamics.

An interesting finding is that increases in model size, data volume, and frame count do not lead to improved PhysBench performance.

PhysAgent

Task-specific Prompt Activation - Classifies questions and retrieves relevant physical knowledge from knowledge memory to include in prompts

Foundation Models Integration - Leverages outputs from vision foundation models like Depth Anything, SAM, and GroundingDINO to enhance perceptual capabilities where VLMs are weak, including depth estimation, object detection, and segmentation