Existing benchmarks (MMLU, GSM8K, etc.) target "problems that are difficult for humans," but GAIA focuses on "real-world tasks that are easy for humans but challenging for AI".
gaia-benchmark (GAIA)
Benchmarking General AI Agents
https://huggingface.co/gaia-benchmark
arxiv.org
https://arxiv.org/pdf/2311.12983

Seonglae Cho