*Do LLMs learn to reason, or are they just memorizing?*🤔 We investigate LLM memorization in logical reasoning with a local inconsistency-based memorization score and a dynamically generated Knights & Knaves (K&K) puzzle benchmark. 🌐: https://t.co/TxqA5sxSuv (1/n) https://t.co/HnhxUdsoSr
My intuitive sense for LLMs is that they are thinking, but imperfectly. Obviously, there are many types of evals where they fail on even basic tasks. This is one way to evaluate thinking, and certainly on that metric they appear to be really amazing stochastic parrots. However,…
LLMs struggle with basic human-like reflection abilities Reflection-Bench tests if LLMs can actually learn from their mistakes like humans do Original Problem 🤔: Current benchmarks for evaluating LLM intelligence lack a unified, biologically-inspired framework. Most… https://t.co/RZTzZXoAph
OpenAI has introduced SimpleQA, a new benchmark designed to evaluate the factual accuracy of large language models (LLMs) by testing their ability to answer short, fact-seeking questions. This initiative aims to enhance the reliability of LLMs in providing accurate information. In a related development, a benchmark called SimpleBench has revealed that LLMs are poorly calibrated, with human participants scoring 84% compared to the best LLM tested, which scored only 42%. Another benchmark, Reflection-Bench, assesses whether LLMs can learn from their mistakes, highlighting the current limitations in evaluating LLM intelligence. These benchmarks reflect ongoing efforts to improve LLM performance and understanding of their reasoning capabilities.