Feb 22, 05:24 PM

xAI's Grok 3 Scores 71.57% Overall, 67.38% in Coding, Faces Allegations of Misleading Benchmarking Against OpenAI's o3-mini (82.74%, 69.69%)

The performance of xAI's Grok 3 model has come under scrutiny following a comparison with OpenAI's o3-mini model. According to LiveBench scores, Grok 3 achieved an overall average score of 71.57%, with a coding task score of 67.38%. In contrast, OpenAI's o3-mini, dated January 31, 2025, scored 82.74% overall and 69.69% in coding tasks. This discrepancy has led to allegations of misleading benchmark reporting by xAI, with an OpenAI employee suggesting that xAI may have manipulated results. The controversy has sparked discussions about the reliability of AI models and their training data, with critics highlighting the potential for bias in AI outputs due to the sources of their training material.

#OpenAI #LiveBench

Written with ChatGPT (GPT-4o mini).