Jul 4, 04:30 PM

Leaked Benchmarks Hint Grok-4 Tops Key AI Reasoning Tests

Pre-release benchmark results purportedly showing the performance of the next-generation artificial-intelligence models Grok-4 and Grok-4 Code have surfaced online. According to the leaked figures, Grok-4 scores 35% on the Human Last Exam (HLE) benchmark in standard mode and 45% when chain-of-thought reasoning is enabled. The same leak places Grok-4 at 87-88% on GPQA and 95% on the AIME’25 math competition test, while Grok-4 Code records 72-75% on the software-engineering SWEBench suite. Scores for Terminal Bench were referenced but not disclosed. If accurate, the numbers would represent state-of-the-art results across several widely followed evaluation sets, signalling a potential jump in reasoning and coding capabilities ahead of any formal announcement. The benchmarks remain unverified, and the developer of the Grok series has yet to comment.

#Human Last Exam #SWEBench #Terminal Bench #Grok

Written with ChatGPT .

Sources

Additional media

Image #1 for story leaked-benchmarks-hint-grok-4-tops-key-ai-reasoning-tests-6ea302ee

Image #2 for story leaked-benchmarks-hint-grok-4-tops-key-ai-reasoning-tests-6ea302ee

Image #3 for story leaked-benchmarks-hint-grok-4-tops-key-ai-reasoning-tests-6ea302ee

Image #4 for story leaked-benchmarks-hint-grok-4-tops-key-ai-reasoning-tests-6ea302ee

Image #5 for story leaked-benchmarks-hint-grok-4-tops-key-ai-reasoning-tests-6ea302ee

Image #6 for story leaked-benchmarks-hint-grok-4-tops-key-ai-reasoning-tests-6ea302ee

Image #7 for story leaked-benchmarks-hint-grok-4-tops-key-ai-reasoning-tests-6ea302ee

Leaked Benchmarks Hint Grok-4 Tops Key AI Reasoning Tests

Sources

Additional media

Similar Stories