OpenAI Says Claude Is The Best A new benchmark that measures how good LLMs are at doing scientific research puts Claude on top! Yes, Sonnet 3.5 and 3.7 continue to top all challenging real-world tasks https://t.co/D59bIokViL
New Benchmark alert: Humans are in the lead! OpenAI’s PaperBench just dropped, testing AI on replicating ICML 2024 research! Claude 3.5 Sonnet scores 21%, but humans top it. open-sourced for all, it’s a game-changer for AI-driven research—automating tasks and boost collab. https://t.co/9VFKpfDhWR
the moment OpenAI published PaperBench LMAO https://t.co/WcJCjedWPs https://t.co/r4wCP7tAtl
OpenAI has released PaperBench, a new benchmark designed to evaluate the capabilities of AI agents in replicating state-of-the-art AI research. The benchmark involves tasks derived from 20 ICML 2024 research papers, broken down into 8,316 tasks, testing the AI's ability to understand papers, write code, and execute experiments. In the PaperBench tests, the top-performing AI model was Claude 3.5 Sonnet (New) with open-source scaffolding, achieving a score of 21.0%. This performance, however, falls significantly behind that of top ML PhDs, who scored 41.4% on the same tasks. Other AI models tested, including GPT-4o, Gemini 2.0 Flash, DeepSeek R1, and o1, were outperformed by Claude 3.5 Sonnet. PaperBench is noted for its difficulty to game, as it requires real-world understanding and execution, making memorization or 'prompt tuning' ineffective. This benchmark is part of OpenAI's Preparedness Framework, aimed at assessing the readiness of AI models for complex research tasks.