🚨 ARC-AGI Reality Check o3 (md): 53% on ARC-AGI-1 o3 (released): 3% on ARC-AGI-2 Proof that AGI hype is still ahead of the curve. The gap between guided problem-solving and general reasoning is real! #AI #AGI #MachineLearning #ArtificialIntelligence https://t.co/SUucJzU8Bq
Sometimes I find memory to be very useful, other times I find it to be non-useful. It would be good if we could toggle it on or off for a given conversation. https://t.co/XwOKRLDS83
o3 is on the faster time horizon trend that started in 2024, suggesting AI is moving ~2x faster than before, and possibility accelerating. https://t.co/3N0gsXnify https://t.co/Nvn2dUIsA0
OpenAI's o3-medium model has demonstrated leading performance on the ARC-AGI-1 benchmark, achieving a score of 53% to 57% depending on the evaluation, at a cost of $1.5 per task. This performance is approximately double the accuracy and 20 times more cost-efficient than other chain-of-thought reasoning systems. The o4-mini model also showed competitive efficiency on ARC-AGI-1, with scores ranging from 21% to 42% depending on the variant. However, both models performed poorly on the more challenging ARC-AGI-2 benchmark, with scores below 3%, highlighting the current limitations in general reasoning capabilities. The ARC-AGI-1 benchmark continues to provide nuanced insights into AI reasoning models, while ARC-AGI-2 remains largely unsolved. Users and analysts have praised o3 for its advanced reasoning and search capabilities, describing it as a significant step forward in AI performance, capable of performing tasks akin to a junior analyst but much faster. Despite its strengths, the models sometimes fail to produce outputs at higher reasoning settings. The results underscore the gap between guided problem-solving and general artificial general intelligence (AGI), suggesting that AGI remains an aspirational goal.