really excited to share our first paper: "On the Evaluation of Engineering Artificial General Intelligence" --> a paradigm shift has occurred in the AI field over the past year - the focus shifted from algorithms to ever more complex and diverse evals and RL environments --> https://t.co/WFXHXI1NZw
ARC-AGI-2 paper is out! Here are the new principles of the challenge: • Requires multi-rule, multi-step and contextual reasoning. • Grids are larger, contain more objects, and encode multiple interacting concepts. • Tasks are novel and not re-usable to limit memorization. • https://t.co/3X1FBOdvzc
ARC v2 paper is officially out! we tested v2 in a controlled setting with over 400 humans, this report contains details and analysis to substantiate our relative "easy for humans, hard for AI" claim. we'll be releasing the raw data later this week. https://t.co/GGzQNhFnwN
A new benchmark called ARC-AGI-2 has been introduced to evaluate the abstract reasoning capabilities of artificial intelligence systems. According to the research, humans solve 100% of the tasks presented in the benchmark, while leading AI models score less than 5%. The ARC-AGI-2 benchmark features more complex challenges including multi-rule, multi-step, and contextual reasoning tasks with larger grids and multiple interacting concepts. The tasks are designed to be novel and not reusable to prevent memorization. The paper, authored by researchers including Francois Chollet, Mike Knoop, Greg Kamradt, and Henry Pinkard, highlights a fundamental gap between human and artificial intelligence and underscores that current frontier AI models have not achieved artificial general intelligence (AGI). The study involved controlled testing with over 400 humans to substantiate the claim that the tasks are easy for humans but difficult for AI. This benchmark represents a shift in AI evaluation from focusing solely on algorithms to emphasizing complex and diverse evaluation environments and reinforcement learning settings.