Future House has released LAB-Bench, a comprehensive set of over 2000 evaluations designed to test the capabilities of language models and agents in performing scientific research tasks in biology. Public models have generally underperformed compared to PhD and postdoc-level humans on nearly all tasks. Claude 3.5 Sonnet has emerged as the leading model, although it still has significant room for improvement. LAB-Bench is the first evaluation set aimed at assessing whether models can conduct scientific research rather than merely knowing scientific trivia. This initiative is seen as a crucial step towards enhancing the role of language models as scientific collaborators. Additionally, SciCode, a benchmark challenging language models to code solutions for scientific problems from advanced papers, shows that GPT-4 and Sonnet 3.5 achieve less than 5% accuracy, with about 10% of its challenges based on Nobel-winning research.
Robust evals are the first step to making real progress on LLMs improving as scientific collaborators to human scientists. Glad to see Future House release this set. If the rate of progress against other evals is any sign, now buckle up! https://t.co/dlb9CjDt75
LAB-Bench is, to my knowledge, the first set of evals designed to measure whether models/agents can do scientific research, not just whether they know assorted trivia about science. Procedural evaluations like this for complex tasks will be very important going forward. https://t.co/PG3yV7E3Qw
Really important work on evals for AI in real-world biology tasks👇 https://t.co/GwdVvj4L2J