🚨 Grok 4 and Grok 4 Code are going to create a AI storm.. truly SOTA benchmarks out now https://t.co/AdCsTyx8pd
Some leaked Grok-4 and Grok-4 Code results - 35% on HLE, 45% with reasoning!! - 87-88% on GPQA - 72-75% on SWE Bench (Grok 4 Code) https://t.co/G1J29fI4GT https://t.co/1uN1QYYlHN
BREAKING: Grok 4 benchmarks leaked Impressive numbers: 🔹 grok-4-0629 Standard: HLE 35, GPQA 87, AIME’25 95 Test Time: HLE 45, GPQA 88, AIME’25 — 🔹 grok-4-code-0629 Standard: SWEBench 72, Terminal Bench — Test Time: SWEBench 75, Terminal Bench https://t.co/QCmietOFI2
Pre-release benchmark results purportedly showing the performance of the next-generation artificial-intelligence models Grok-4 and Grok-4 Code have surfaced online. According to the leaked figures, Grok-4 scores 35% on the Human Last Exam (HLE) benchmark in standard mode and 45% when chain-of-thought reasoning is enabled. The same leak places Grok-4 at 87-88% on GPQA and 95% on the AIME’25 math competition test, while Grok-4 Code records 72-75% on the software-engineering SWEBench suite. Scores for Terminal Bench were referenced but not disclosed. If accurate, the numbers would represent state-of-the-art results across several widely followed evaluation sets, signalling a potential jump in reasoning and coding capabilities ahead of any formal announcement. The benchmarks remain unverified, and the developer of the Grok series has yet to comment.