Sources
Ate-a-PiJust so we’re all clear on what happened > Aug 14 - OpenAI releases SWE-bench-verified > because they suspected “SWE-bench systematically underestimating models’ autonomous software engineering capabilities” This means there is internal expectation that they will saturate… https://t.co/8n90NycnEY
Neil ChowdhuryOur Preparedness team evaluates frontier models’ abilities as software engineering agents, a prerequisite skill that could one day enable models to operate autonomously and self-improve. SWE-bench has become the community standard for evaluating models on software engineering,… https://t.co/5i16Gr0jAQ
Casper HansenThis release of SWE-bench Verified feels to me like OpenAI will launch a new model that is highly capable in coding & math, a competitor to Sonnet 3.5. Only time can tell if 🍓/ Q* + larger model works like we think it will.

