This paper introduces a new benchmark demonstrating that LLMs have almost no ability to read entire papers and find errors, with experiments proving that current models nearly all fail at this task. Current LLMs have virtually no capability for full paper verification. RAG provides almost no help.
openreview.net
https://openreview.net/pdf?id=GDA1yB6yDP

Seonglae Cho