Apr 10, 08:29 PM

OpenAI Releases BrowseComp Benchmark with 1,266 Questions; AI Models Achieve Less Than 2% and 51.5% Accuracy

OpenAI has announced the open-sourcing of BrowseComp, a new benchmark designed to evaluate the performance of AI agents in browsing the internet for difficult-to-find information. This benchmark, referred to as the 'Browsing Competition', consists of 1,266 short-answer questions aimed at testing the capabilities of AI agents. Initial results show that models such as GPT-4.5 and GPT-4o, which include browsing capabilities, achieved less than 2 percent accuracy on the benchmark. In contrast, a specialized model known as Deep Research, which was specifically trained for this task, attained an accuracy rate of 51.5 percent. The release of BrowseComp is intended to provide a challenging environment for AI agents, akin to competitive coding or math contests, thereby enhancing the evaluation of their browsing intelligence.

#OpenAI #BrowseComp #Browsing Competition #Deep Research

Written with ChatGPT (GPT-4o mini).

OpenAI Releases BrowseComp Benchmark with 1,266 Questions; AI Models Achieve Less Than 2% and 51.5% Accuracy

Sources

Additional media

Similar Stories