A new autonomous programming agent developed by @shawnup, utilizing OpenAI's o1 model, has achieved a state-of-the-art score of 64.6% on SWE-Bench Verified. This accomplishment, which marks a significant advancement in the field, was made possible through the use of the Weights & Biases Programmer and the Weave toolkit, alongside continuous iterations. The results have garnered positive recognition within the AI community, with various users congratulating @shawnup for this milestone. Additionally, DeepSeek-V3, an open large language model (LLM) from DeepLearningAI, has surpassed notable benchmarks, including Llama 3.1 405B and GPT-4o, particularly excelling in coding and math tasks. This model employs a mixture-of-experts architecture with 671 billion parameters, activating only 37 billion at any given time, and has been trained at a low cost, indicating advancements in AI capabilities from China.
DeepSeek-R1 Preview achieves SOTA reasoning on LiveCodeBench, rivaling o1-Medium. https://t.co/Rm0ajEuO4D
People seem to be confused about what I imply. It's not braindead Whale Eats o3 boosterism. R1 (Preview) is ≈ o1 and imo ≈ R1; the major diff will be, again, in low-high test time compute settings. But: an open reasoner can be finetuned to specific hard agentic objectives. https://t.co/eM7CxvGPn6
On LiveCodeBench, DeepSeek-R1 scores somewhere between o1 low reasoning and o1 medium reasoning Note that this looks to be full R1 and not the lite version from before https://t.co/2N663zHG9E https://t.co/ZUdSnGdsQR