
The ARC-AGI benchmark has garnered significant attention recently as a challenging problem for large language models (LLMs) to solve. Ryan Greenblatt achieved 71% accuracy on a set of examples where humans typically achieve 85%, marking a state-of-the-art (SOTA) performance. The benchmark, which offers a $1 million prize, involves generating many possible Python programs to implement transformations, using a carefully-crafted few-shot prompt, generating ~5k guesses, and selecting the best ones using examples and a debugging step. Some experts argue that solving ARC-AGI does not equate to achieving artificial general intelligence (AGI) but recognize it as a valuable challenge highlighting LLMs' weaknesses in cell-based rules like the Game of Life. Another attempt using GPT-4o reached 50% accuracy, demonstrating progress through clever tricks and increased computational search.

Progress on $1M ARC-AGI benchmark that is very hard for LLMs by carefully-crafted few-shot prompt to generate many possible Python programs to implement the transformations, generating ~5k guesses, selecting the best ones using the examples, and a debugging step, which is… https://t.co/jCfuY1fsps
Progress on $1M ARC-AGI benchmark that is very hard for LLMs by carefully-crafted few-shot prompt to generate many possible Python programs to implement the transformations, generating ~5k guesses, selecting the best ones using the examples, and a debugging step. https://t.co/jCfuY1fsps
50% on ARC-AGI with GPT-4o This wonderful blog post brings out another point that I didn't explicitly mention in my blog -- ARC-AGI gets solved with a bunch of very clever tricks around existing models, and more search compute. https://t.co/YvoT4PC3yz https://t.co/CeXqixsbSF