
BigCodeBench, a new benchmark designed to evaluate large language models (LLMs) on practical and challenging programming tasks, has been introduced. Unlike simpler benchmarks like HumanEval and MBPP, BigCodeBench tests LLMs on more realistic and comprehensive coding scenarios, including open code and math LLMs. This initiative is led by Terry Yue Zhuo and aims to address the saturation of basic coding benchmarks by state-of-the-art (SOTA) LLMs. Current top models, including GPT-4 and new models from DeepSeek AI, achieve around 50% success on these tasks, which involve a wide range of tool/library calls. The benchmark highlights that approximately 40% of tasks remain unsolved by SOTA models, emphasizing the need for more robust evaluation standards in the field of AI and machine learning.
These benchmarks are impressive - Open Source LLMs are re-taking serious ground! https://t.co/1aUO781DdW
Reasonable evaluation must be the basis for a healthy development of LLMs, and I believe 🌸BigCodeBench will help us to trace the realistic performance of Code LLMs very well! Thanks to @terryyuezhuo lead for this excellent work! https://t.co/uuUElLeRZq
LLMs are making progresses, so are the benchmarks! BigCodeBench is a new coding benchmark with comprehensive real-life tasks (e.g., beyond using std libraries). There’s still about 40% of tasks not solved by SOTA models yet. Awesome work led by @terryyuezhuo ♥️ https://t.co/gRWconIhpw
