o4-mini, o3, and Gemini 2.5 Flash Preview all perform well on the Thematic Generalization Benchmark. However, Claude 3.7 Sonnet Thinking 16K and Gemini 2.5 Pro remain at the top. https://t.co/wBRF0nG97T
o3 (high reasoning) and o4-mini (high reasoning) take 2nd and 3rd place on the Extended NYT Connections Benchmark, slightly behind o1-pro. Gemini 2.5 Flash Preview (16K) improves upon Gemini 2.0 Flash Thinking Experimental: 23.1 → 25.8. https://t.co/Lb996DX2NO
Using the new OpenAI long context benchmark Gemini 2.5 Pro scores better at 1M tokens than o4-mini does at 130k. $goog https://t.co/MZfiWgqr14
The latest evaluations of AI coding models reveal that Gemini 2.5 Pro continues to lead in coding performance and cost efficiency compared to OpenAI's o3 and o4-mini models. While o3 scored 6.7% higher than Gemini 2.5 Pro in some benchmarks, it is approximately 17 times more expensive, and o4-mini scored lower than Gemini 2.5 Pro despite being three times more costly. Gemini 2.5 Pro also maintains a higher pass rate on coding tasks on the first try, only being surpassed by o3 after multiple attempts. Additionally, Gemini 2.5 Pro outperforms o4-mini on long context benchmarks, handling up to 1 million tokens compared to o4-mini's 130,000 tokens. In terms of coding style, Gemini 2.5 Pro respects original code structure more than o4-mini, which tends to rewrite existing code. OpenAI's o3 and o4-mini excel in refining specifications and building new features, with o3 offering strong improvement suggestions. On broader benchmarks such as the Extended NYT Connections and Thematic Generalization, o3 and o4-mini rank well but remain slightly behind other models like o1-pro and Claude 3.7 Sonnet Thinking 16K. Overall, Gemini 2.5 Pro holds a balance of strong coding ability, cost-effectiveness, and long-context handling that currently positions it ahead of OpenAI's comparable models.