Apr 2, 05:37 PM

OpenAI's PaperBench Shows Claude 3.5 Sonnet at 21.0%, Behind Top ML PhDs at 41.4%

OpenAI has released PaperBench, a new benchmark designed to evaluate the capabilities of AI agents in replicating state-of-the-art AI research. The benchmark involves tasks derived from 20 ICML 2024 research papers, broken down into 8,316 tasks, testing the AI's ability to understand papers, write code, and execute experiments. In the PaperBench tests, the top-performing AI model was Claude 3.5 Sonnet (New) with open-source scaffolding, achieving a score of 21.0%. This performance, however, falls significantly behind that of top ML PhDs, who scored 41.4% on the same tasks. Other AI models tested, including GPT-4o, Gemini 2.0 Flash, DeepSeek R1, and o1, were outperformed by Claude 3.5 Sonnet. PaperBench is noted for its difficulty to game, as it requires real-world understanding and execution, making memorization or 'prompt tuning' ineffective. This benchmark is part of OpenAI's Preparedness Framework, aimed at assessing the readiness of AI models for complex research tasks.

#OpenAI #PaperBench #New #DeepSeek R1 #Preparedness Framework

Written with ChatGPT .