GAIA benchmark

Created

2025 Feb 9 16:29

Creator

Seonglae Cho

Editor

Seonglae Cho

Edited

2025 Oct 15 23:52

Refs

Existing benchmarks (MMLU, GSM8K, etc.) target "problems that are difficult for humans," but GAIA focuses on "real-world tasks that are easy for humans but challenging for AI".

gaia-benchmark (GAIA)

Benchmarking General AI Agents

https://huggingface.co/gaia-benchmark

arxiv.org

https://arxiv.org/pdf/2311.12983

Recommendations

//////////