Apr 14, 07:27 PM

OpenAI's GPT-4.1 Scores 5.5% on ARC-AGI, 87.4% on IFEVAL, Shows Mixed Results Against DeepSeekV3

Recent evaluations of OpenAI's GPT-4.1 indicate mixed performance results compared to competing models. GPT-4.1 reportedly outperforms Gemini 2.5 Pro and nearly matches Claude 3.7 in coding tasks. However, it still lags behind DeepSeekV3 by over 10% on the AIME benchmark and is priced at eight times more than its competitor. In the latest LiveBench results, GPT-4.1 shows performance on par with Sonnet 3.7 and DeepSeekV3, marking a somewhat positive outcome. Other evaluations reveal that GPT-4.1's performance on the ARC-AGI benchmarks is low, with scores of 5.5% and 0.0% for ARC-AGI-1 and ARC-AGI-2, respectively. The mini version of GPT-4.1 scored 3.5% and 0.0% on the same benchmarks. Additionally, it achieved 87.4% on the IFEVAL results, while Sonnet 3.5 scored slightly higher at 90.2%. These results suggest that while GPT-4.1 is fast and cost-effective, its coding capabilities may not meet the expectations set by its predecessors and competitors.

#OpenAI #Gemini #Pro #Claude #DeepSeekV3 #LiveBench #Sonnet #GPT

Written with ChatGPT (GPT-4o mini).