Existing benchmarks (MMLU, GSM8K, etc.) target "problems that are difficult for humans," but GAIA focuses on "real-world tasks that are easy for humans but challenging for AI".
GAIA benchmark
Created
Created
2025 Feb 9 16:29Creator
Creator

Editor
Editor

Edited
Edited
2025 Oct 15 23:52Refs
Refs