The MMLU-Pro CS benchmark results have been released, showcasing the performance of various language models. After 59 runs across 25 models over 70 hours, Alibaba's QwQ-32B-Preview emerged as the top local model. It achieved a benchmark score of 70.03 in MMLU-PRO. In addition, several new models were added to the leaderboard, with notable rankings including a model that secured the overall first place in the 70+B category with an average score of 52.02, and another that ranked 81st in the 35B category with an average score of 36.2. Other models ranked within the top spots include those in the 13B and 7B categories, each demonstrating varying performance across multiple benchmarks such as IFEval and MATH Lvl 5. The ongoing developments in model benchmarking indicate a competitive landscape in the field of large language models.
New model added to the leaderboard! Model Name https://t.co/cK4Td5iwbB Overall rank: 1783 Rank in 1.5B category: 106 Benchmarks Average: 9.9 IFEval: 17.29 BBH: 14.04 MATH Lvl 5: 7.7 GPQA: 0.89 MUSR: 4.61 MMLU-PRO: 14.88
New model added to the leaderboard! Model Name https://t.co/nCtHurkASY Overall rank: 1276 Rank in 13B category: 314 Benchmarks Average: 18.16 IFEval: 30.53 BBH: 34.22 MATH Lvl 5: 8.69 GPQA: 2.13 MUSR: 10.91 MMLU-PRO: 22.51
New model added to the leaderboard! Model Name https://t.co/hBiispoLzD Overall rank: 300 Rank in 7B category: 49 Benchmarks Average: 29.58 IFEval: 78.87 BBH: 32.11 MATH Lvl 5: 16.62 GPQA: 9.62 MUSR: 9.11 MMLU-PRO: 31.16