
Recent research indicates that GPT-4 and Flan-PaLM, large language models (LLMs), have achieved adult-level and near adult-level performance on Theory of Mind (ToM) tasks. Notably, GPT-4 exceeds adult human performance on 6th order inferences. The study, published on May 29th on arXiv, involved 1,440 data points, though some data appeared noisy due to the small number of questions. The findings highlight the potential of LLMs in complex cognitive tasks, although the reliability of benchmarks for such tasks remains a topic of debate.
LLMs “intelligence” is hard to benchmark, as we don’t have good benchmarks for human performance at complex tasks. Take theory-of-mind: several tests found GPT-4 beats humans, but another one finds a huge gap. Is it the testing structure? Prompting? Which is right? Hard to know. https://t.co/z9L3stRCDP
⁉️ Let's check how GPT-4o, Gemini, Llama3, Mixtral, and Claude perform on theory of mind, shall we?🌟We report new results on Benchmark FANToM👻 - GPT-4o tops the chart by finally achieving score of 2.0/100 (vs. Human 87.5) - Huge boost for Gemini-1.5-flash compared to… https://t.co/rtuNsfEyIF https://t.co/GZWidSdIYs
LLMs achieve adult human performance on higher-order theory of mind tasks GPT-4 and Flan-PaLM reach adult-level and near adult-level performance on ToM tasks overall, and that GPT-4 exceeds adult performance on 6th order inferences https://t.co/SsKbu4PbCo
