Recent research highlights significant challenges faced by large language models (LLMs) in reasoning over long contexts. Despite advancements in retrieval capabilities, LLMs struggle with compositional reasoning tasks. For instance, the NoCha benchmark, which involves verifying claims about fictional books, reveals that none of the 11 tested LLMs, including GPT-4o, achieved human-level performance, with the best model scoring only 55.8% compared to the human benchmark of 97%. This performance gap is attributed to LLMs' U-shape positional attention bias, which affects their generation behavior. Further studies using tasks like the needle-in-the-haystack and comprehension challenges by real human readers indicate that LLMs still perform at or below random chance on complex reasoning tasks, despite their strong performance on synthetic benchmarks.
Wondering how LLMs do on the Comprehension Challenge I proposed in 2014? New results from an easier (written not visual) version of that test: “no open-weight model performs above random chance (despite their strong performance on synthetic benchmarks)” https://t.co/0YXkbIRWpt
Wondering how LLMs do on the Conprehension Challenge I proposed in 2014? New results from an easier (written not visual) version of that test: “no open-weight model performs above random chance (despite their strong performance on synthetic benchmarks)” https://t.co/0YXkbIRozV
🤔 Do you think LLM reasoning is solved? 🏆 High leaderboard numbers may not tell the whole story! Check out our new paper investigating the robustness of LLMs in reasoning! 🧠 https://t.co/sspCqWtf6c