OpenAI's model, o3, has achieved a notable milestone by scoring 25% accuracy on the challenging FrontierMath benchmark, which was designed to test advanced mathematical problem-solving. This benchmark is considered extremely difficult, with reports indicating that even GPT-4 only managed to solve less than 2% of the problems. The FrontierMath problems were crafted by over 60 professional mathematicians, including math professors, and are not included in any training datasets. In response to the challenges posed by FrontierMath, a new suite of math problems called Tier 4 has been announced, aiming to exceed the difficulty of the existing problems. Additionally, the FineMath dataset has been released, which is expected to provide a robust resource for training AI models in mathematical reasoning.
AI is getting better at math but we're just scratching the surface of what they will be capable of doing IMO (ex O3 only got 25% on FrontierMath). So we're super excited to release FineMath, the best open math dataset for everyone to use. Currently number one trending datasets… https://t.co/yW4Q7E2zEV
FrontierMath, the newly released Math-Benchmark is so hard that even GPT-4 solves less than 2% of problems 60+ Pro Mathematicians including by Math Professors crafted these problems and they are not in any training data. OpenAI o3 did 25% on THIS. 🔧 FrontierMath details: →… https://t.co/1uWlsNFzRm
Not trying to move the goal posts here but if the model was a step towards AGI and “General” being the key word here, I would be curious how much o3 would have gotten without being trained on the training set on the ARC-AGI eval? A few shot prompt would be fine. Gary is mostly…