GPT-4 models have demonstrated performance at or above human levels in certain Theory of Mind (ToM) tasks, such as identifying indirect requests, false beliefs, and misdirection, but struggled with detecting faux pas. Specifically, GPT-4 surpassed adult human performance in 6th Order ToM inferences, suggesting that increased model size, instruction fine-tuning, multimodal capabilities, and word comprehension contribute to its ability to model mental states. Despite these achievements, there are inconsistencies in benchmarking LLMs' intelligence, as some tests show a significant gap between GPT-4 and human performance, potentially due to differences in testing structures or prompting methods.
🧠"Thinking at a Distance" in the Age of AI LLMs, with their vast corpora and speed, redefine the essence of cognition. The extraordinary rise of large language models (LLMs) has exposed a curious split between human and artificial intelligence when it comes to processing… https://t.co/vT1AkHkfyf
Using #ChatGPT in the Development of Clinical Reasoning Cases: A Qualitative Study https://t.co/PrUhrGZJky
LLMs “intelligence” is hard to benchmark, as we don’t have good benchmarks for human performance at complex tasks. Take theory-of-mind: several tests found GPT-4 beats humans, but another one finds a huge gap. Is it the testing structure? Prompting? Which is right? Hard to know. https://t.co/z9L3stRCDP