
The Open LLM Leaderboard 2 has been released, introducing new benchmarks and features to evaluate large language models (LLMs) more effectively. This update aims to address the plateauing performance of LLMs by incorporating new evaluation metrics such as IFEval, BBH, MATH Lvl 5, GPQA, MUSR, and MMLU-PRO. The leaderboard now includes high-quality datasets, chat templates, and a community voting system to prioritize model evaluations. Qwen 72B Instruct currently leads the leaderboard. The update also emphasizes fairer, transparent, and reproducible comparisons of LLMs, with new visualizations and technical details to enhance user experience. The leaderboard is available on the huggingface Hub, and 300 H100 GPUs were used to re-run new evaluations.
Fabulous talk today by @BorisMPower of @OpenAI at @Yale @yaledatascience @YINSedge on “ChatGPT and the Future of LLMs.” The developments are mind-blowing. #HNL https://t.co/tnY4c6USj9
Big news! The open llm leaderboard will be hard to game for a couple weeks! Looking forward to checking out Leaderboard 3 but for now, I'm choosing models based on my use case with MyxMatch Find fitness for free: https://t.co/uu8qp62QBB https://t.co/xVITCi3qq6 https://t.co/BJj2MoHlL2
Pumped to announce the brand new open LLM leaderboard. We burned 300 H100 to re-run new evaluations like MMLU-pro for all major open LLMs! Some learning: - Qwen 72B is the king and Chinese open models are dominating overall - Previous evaluations have become too easy for recent…


