Apr 19, 01:17 AM

SimularAI’s Gemini 2.5 Pro Achieves 41.4% on OS-World, Excels in Pokémon Benchmark with 5 Badges, Surpassing Claude 3.7

Gemini 2.5 Pro, developed by SimularAI, has set a new state-of-the-art (SOTA) performance on the OS-World benchmark, achieving a 41.4% success rate within 50 steps. This surpasses the previous best held by Agent S combined with Claude 3.7, which had a 34.5% success rate. Gemini 2.5 Pro also outperforms competitors from OpenAI and Anthropic, which scored 32.6% and 26%, respectively. The model has demonstrated strong capabilities in completing the Pokémon benchmark, earning its fifth badge, ahead of other models that have only achieved up to three badges. However, some note that the comparison with Claude 3.7 is not entirely equitable, as Gemini 2.5 Pro benefits from fully labeled minimaps, a helper path-finder, and live prompt adjustments, whereas Claude 3.7 operates with only immediate screen visibility and limited navigational support. The Gemini 2.5 Pro's advancements were highlighted during talks and a hackathon at the AGI House event, with demonstrations planned using platforms such as Google AI Studio. The model is regarded as pushing the boundaries of AI agent performance and is gaining recognition as a leading foundation model in this space.