Recent evaluations of OpenAI's GPT-4.1 indicate mixed performance results compared to competing models. GPT-4.1 reportedly outperforms Gemini 2.5 Pro and nearly matches Claude 3.7 in coding tasks. However, it still lags behind DeepSeekV3 by over 10% on the AIME benchmark and is priced at eight times more than its competitor. In the latest LiveBench results, GPT-4.1 shows performance on par with Sonnet 3.7 and DeepSeekV3, marking a somewhat positive outcome. Other evaluations reveal that GPT-4.1's performance on the ARC-AGI benchmarks is low, with scores of 5.5% and 0.0% for ARC-AGI-1 and ARC-AGI-2, respectively. The mini version of GPT-4.1 scored 3.5% and 0.0% on the same benchmarks. Additionally, it achieved 87.4% on the IFEVAL results, while Sonnet 3.5 scored slightly higher at 90.2%. These results suggest that while GPT-4.1 is fast and cost-effective, its coding capabilities may not meet the expectations set by its predecessors and competitors.
𝐆𝐏𝐓-𝟒.𝟏 𝐚𝐥𝐦𝐨𝐬𝐭 𝐭𝐨𝐩𝐬 𝐂𝐥𝐚𝐮𝐝𝐞 𝟑.𝟕 𝐨𝐧 𝐜𝐨𝐝𝐢𝐧𝐠?! New eval dropping using our #1 SWE-bench coding agent! - GPT-4.1 beats Gemini 2.5 Pro and almost tops Claude 3.7 Sonnet! - Even GPT-4.1 mini matches Claude 3.5 Sonnet V2 performance. It was the https://t.co/6l1vigcsio
LiveBench results are out for GPT-4.1 on par with Sonnet 3.7 and DeepSeekV3 (first somewhat positive result I see) https://t.co/p6ib5d7Tij
IFEVAL results: GPT-4.1: 87.4% Sonnet 3.5: 90.2%