OpenAI has released SWE-bench Verified on August 14, a tool designed to provide a more accurate assessment of autonomous software engineering capabilities in AI models. The release addresses concerns that the previous version, SWE-bench, may have systematically underestimated these capabilities. SWE-bench has become a community standard for evaluating AI models in software engineering, with the Preparedness team, led by Neil Chowdhury, highlighting its role in assessing models' abilities as software engineering agents. The new version is anticipated to compete with other advanced models like Sonnet 3.5 in coding and math proficiency.
Just so we’re all clear on what happened > Aug 14 - OpenAI releases SWE-bench-verified > because they suspected “SWE-bench systematically underestimating models’ autonomous software engineering capabilities” This means there is internal expectation that they will saturate… https://t.co/8n90NycnEY
Our Preparedness team evaluates frontier models’ abilities as software engineering agents, a prerequisite skill that could one day enable models to operate autonomously and self-improve. SWE-bench has become the community standard for evaluating models on software engineering,… https://t.co/5i16Gr0jAQ
This release of SWE-bench Verified feels to me like OpenAI will launch a new model that is highly capable in coding & math, a competitor to Sonnet 3.5. Only time can tell if 🍓/ Q* + larger model works like we think it will.