The AI research organization ARC has released a public preview of ARC-AGI-3, an interactive reasoning benchmark designed to challenge advanced artificial intelligence systems. This new version includes six novel games, with three games made available in the initial release. These games focus on interactive reasoning abilities, such as adaptive world modeling and agent-based problem solving, which differ from previous benchmarks that emphasized deep learning and static reasoning. Current frontier AI models, including leading large language models like OpenAI's o3, have scored 0% on these tasks, while humans consistently achieve 100%. The benchmark aims to provide a more rigorous measure of progress toward artificial general intelligence (AGI) by testing AI's ability to generalize in novel environments. The ARC team emphasizes the importance of maintaining rigorous and honest benchmarking standards, as some AI benchmark results have faced issues like saturation, contamination, and disputed answer keys. ARC-AGI-3 is part of ongoing efforts to push AI development beyond existing capabilities and better assess AI reasoning in dynamic, interactive contexts.
(Since I am on a benchmark theme today) The ARC team does well keeping AI labs honest about their benchmarks, including showing that Qwen's big ARC-AGI performance doesn't replicate But ARC-AGI also has a strong philosophy of what AI should do. We need other benchmarking efforts https://t.co/e7q9f3ZRAC
Reinforcement learning is powerful, but not always practical. That’s why the new open-source framework ART caught our eye. It makes RL usable for LLM agents, and in this walkthrough, you’ll see how it trains a small open model to beat GPT-4o-mini at Tic-Tac-Toe. https://t.co/nMApPX6Vs5
The mitigating factor for the problem with AI benchmarks (errors, saturation, contamination) is that, despite issues, they are all still fairly heavily correlated. So if your AI does well on GPQA or MMLU or HLE it also tends to do well on other benchmarks & on vibes & real work.