Recent advancements in AI software development agents have led to significant improvements in performance metrics on the SWE-bench. The OpenHands CodeAct 2.1, developed by All Hands AI, has achieved a state-of-the-art resolve rate of 53% on SWE-Bench Verified and 41.7% on SWE-Bench Lite. This new agent surpasses the previous record of 49% set by Anthropic's Claude 3.5 Sonnet. The enhancements in OpenHands CodeAct 2.1 are attributed to its use of function calling and the integration of Claude 3.5. In comparison, other models such as GPT o1-preview and 4-o recorded resolve rates of 38.4% and 33.2%, respectively. The rapid progress in AI coding agents highlights the ongoing evolution in this field, with the SWE-bench serving as a benchmark for evaluating their effectiveness in solving real GitHub issues.
All Hands AI Open Sources OpenHands CodeAct 2.1: A New Software Development Agent to Solve Over 50% of Real Github Issues in SWE-Bench All Hands AI Open Sources OpenHands CodeAct 2.1: a new software development agent, the first to solve over 50% of real GitHub issues in… https://t.co/JvnN2Uy4bD
Open source AllHands + Claude 3.5 Sonnet is now #1 on SWE-Bench Verified with 53%! Anthropic had posted 49% GPT o1-preview was 38.4% 4-o was 33.2% Devin launched 6mo ago at 13.86% on SWE-Bench (~25% on Verified) Before that, sota was 2% Progress in AI coding agents is so fast. https://t.co/LP3amraMbS
Best software development AI agent!? OpenHands CodeAct 2.1 achieves state-of-the-art results: 🥇 53% resolve rate on SWE-Bench Verified 🥇 41.7% resolve rate on SWE-Bench Lite Improvements thanks to function calling, use of Anthropic's Claude 3.5 model, and optimizing… https://t.co/wi8XSyT9PB