puzzle EnigmaEval: A Benchmark of Long Multimodal Reasoning ChallengesAs language models master existing reasoning benchmarks, we need new challenges to evaluate their cognitive frontiers. Puzzle-solving events are rich repositories of challenging multimodal...https://arxiv.org/abs/2502.08859