Google on Monday launched the Kaggle Game Arena, an open benchmarking platform that begins with a three-day chess tournament intended to measure the reasoning ability of frontier language models. The exhibition runs 5–7 August and is streamed live on Kaggle, with commentary from chess grandmaster Hikaru Nakamura and daily recaps by popular streamers. Eight models are taking part: OpenAI’s o3 and o4-mini, Google’s Gemini 2.5 Pro and Gemini 2.5 Flash, Anthropic’s Claude Opus 4, xAI’s Grok 4, Moonshot AI’s Kimi K2 Instruct and DeepSeek-R1. Matches follow a single-elimination, best-of-four format. Models receive only a text description of the board, may not call external chess engines, and forfeit after three illegal moves or if any single move exceeds 60 minutes. Opening-round play saw Grok 4 and both OpenAI entrants progress, while Gemini 2.5 Pro eliminated Claude Opus 4 in a 4-0 sweep. A public leaderboard, updated with additional behind-the-scenes games scored by a Bayesian skill-rating system, currently places Grok 4 at the top. The semifinal on 6 August will pair Grok 4 against Gemini 2.5 Pro; the winner advances to Wednesday’s final. Google says chess offers a transparent, adversarial setting to test strategic planning, memory and adaptation, and it plans to expand the Game Arena to other games such as Go and Werewolf. The company expects the rolling leaderboard to become a long-term reference for evaluating real-time decision-making in large language models.
Tomorrow are the Semis of AI Chess Gemini 2.5 Pro vs Grok 4 https://t.co/2cZGLl8IIw
BREAKING: Gemini 2.5 pro won 4-0 against Claude Opus in chess claude opus 4 has been ELIMINATED. ITS OVER https://t.co/QM1IS8Kaeu
首届 AI 国际象棋比赛,猜猜国产模型表现如何(图2) Google 旗下 Kaggle 推出全新的 Game Arena,以国际象棋为首站,举办一场为期数天的公开模型对弈表演赛,让 ChatGPT、Gemini、Claude 等顶尖大模型在直播里正面较量,并用更贴近棋类对抗的方式评估“推理与决策”能力。 https://t.co/6yC8sehhAH https://t.co/rDg4uEeIly