Great work by METR trying to benchmark how AI compares to human AI researchers at improving AI - a critical capability to monitor. https://t.co/NulKPRz1NY
Awesome work on AI R&D evals! https://t.co/BhBeuhLGIT
How close are current AI agents to automating AI R&D? Our new ML research engineering benchmark (RE-Bench) addresses this question by directly comparing frontier models such as Claude 3.5 Sonnet and o1-preview with 50+ human experts on 7 challenging research engineering tasks. https://t.co/woREKEWn5S
A new benchmark named BALROG has been introduced to evaluate the agentic capabilities of large language models (LLMs) and vision-language models (VLMs) in the context of games. Developed by a team including researchers from University College London (UCL), BALROG aims to address the saturation of existing benchmarks, which have shown diminishing returns in measuring AI progress. The benchmark is designed to remain relevant over time and will focus on long-horizon interactive tasks using reinforcement learning environments. This initiative highlights the ongoing efforts in the AI research community to enhance the scientific rigor of evaluations and improve the capabilities of AI systems.