Gemini 2.5 Pro is doing well on Pokémon but it isn't a fair comparison to Claude! The Gemini run gets fully labeled minimaps, a helper path‑finder, and live prompt tweaks. The Claude run only can see the immediate screen but does get navigational assistance. https://t.co/flhnC2lo4t
>Gemini plays Pokemon has gotten its 4th badge now, it escaped the looping hell >Claude is in Mt Moon again let’s see how @ChatGPTapp does; especially with its powerful image scan https://t.co/VI1idHzV3a
"Our new Ash Ketchum: Gemini 2.5 Pro ✌️" It is basically beating in the Pokémon benchmark especially Claude 3.7! 💪 https://t.co/h6QMoxAhWV https://t.co/1em1Q4D4LX
Gemini 2.5 Pro, developed by SimularAI, has set a new state-of-the-art (SOTA) performance on the OS-World benchmark, achieving a 41.4% success rate within 50 steps. This surpasses the previous best held by Agent S combined with Claude 3.7, which had a 34.5% success rate. Gemini 2.5 Pro also outperforms competitors from OpenAI and Anthropic, which scored 32.6% and 26%, respectively. The model has demonstrated strong capabilities in completing the Pokémon benchmark, earning its fifth badge, ahead of other models that have only achieved up to three badges. However, some note that the comparison with Claude 3.7 is not entirely equitable, as Gemini 2.5 Pro benefits from fully labeled minimaps, a helper path-finder, and live prompt adjustments, whereas Claude 3.7 operates with only immediate screen visibility and limited navigational support. The Gemini 2.5 Pro's advancements were highlighted during talks and a hackathon at the AGI House event, with demonstrations planned using platforms such as Google AI Studio. The model is regarded as pushing the boundaries of AI agent performance and is gaining recognition as a leading foundation model in this space.