May 20, 04:48 PM

ARC-AGI-2 Benchmark Shows AI Scores Below 5% on Multi-Step Reasoning Tasks; Tested by 400+ Humans, Authors Include Chollet

A new benchmark called ARC-AGI-2 has been introduced to evaluate the abstract reasoning capabilities of artificial intelligence systems. According to the research, humans solve 100% of the tasks presented in the benchmark, while leading AI models score less than 5%. The ARC-AGI-2 benchmark features more complex challenges including multi-rule, multi-step, and contextual reasoning tasks with larger grids and multiple interacting concepts. The tasks are designed to be novel and not reusable to prevent memorization. The paper, authored by researchers including Francois Chollet, Mike Knoop, Greg Kamradt, and Henry Pinkard, highlights a fundamental gap between human and artificial intelligence and underscores that current frontier AI models have not achieved artificial general intelligence (AGI). The study involved controlled testing with over 400 humans to substantiate the claim that the tasks are easy for humans but difficult for AI. This benchmark represents a shift in AI evaluation from focusing solely on algorithms to emphasizing complex and diverse evaluation environments and reinforcement learning settings.

#Francois Chollet #Mike Knoop #Greg Kamradt #Henry Pinkard

Written with ChatGPT (GPT-4).

Sources

Additional media

Image #1 for story arc-agi-2-benchmark-shows-ai-scores-below-5-on-multi-step-reasoning-tasks-tested-fe010a57

Image #2 for story arc-agi-2-benchmark-shows-ai-scores-below-5-on-multi-step-reasoning-tasks-tested-fe010a57

ARC-AGI-2 Benchmark Shows AI Scores Below 5% on Multi-Step Reasoning Tasks; Tested by 400+ Humans, Authors Include Chollet

Sources

Additional media

Similar Stories