Animal-AI
Benchmarks
This paper highlights the limitations of evaluating an LLM’s physical commonsense reasoning using only static text or image benchmarks. Such approaches lack ecological validity and construct validity, making it hard to tell whether the model truly understands causal relationships in the physical world.
A little less conversation, a little more action, please:...
As general-purpose tools, Large Language Models (LLMs) must often reason about everyday physical environments. In a question-and-answer capacity, understanding the interactions of physical objects...
https://arxiv.org/abs/2410.23242

Physical understanding benchmark and simulator
ai2thor
allenai • Updated 2026 May 20 20:29
arxiv.org
https://arxiv.org/pdf/2410.23242
AI2-THOR
Open Source Interactive Environments for Embodied AI
https://ai2thor.allenai.org/


Seonglae Cho