Recent evaluations of large language models (LLMs) highlight the competitive landscape among leading technologies. Sonnet 3.5 has been recognized as the best coding LLM, outperforming others like Qwen and o1-mini, which are noted as competitive but not at the same level. Benchmark scores reveal that o1-preview achieved 71.4%, while o1-mini scored 52.6%, and Sonnet 3.5 lagged behind with 33.4%. Additionally, a new model, QwQ, which is an open-source 32B model, has shown impressive reasoning capabilities, ranking second overall, just behind the o1 line. In terms of pricing and performance, Google’s offerings, including Gemini-Exp models, are noted for their competitive pricing, with Google standing out in price/performance comparisons. Overall, the evaluations suggest a significant gap in performance metrics among the top models, with Chinese open-weight LLMs also making notable strides in mathematical capabilities.
These are the BEST LLMs for Mathematics according to LiveBench! Chinese open-weights LLMs are dominating in mathematics! Qwen 2.5 32B Coder, QwQ and Qwen 2.5 72B are all on the forefront of efficiency! Gemini Exp 1206 however dominates them in terms of raw performance and even… https://t.co/7iiQWAP40L
So LMSYS is COOKED - Let's look at LiveBench price to performance instead! I just pulled all the LiveBench data and stacked top LLMs on semi-log cost vs. performance plots! Google's Gemma-2, Gemini 1.5 Flash (&8B), and the new Gemini-EXP-1206 (possibly Gemini 2.0 Pro) are ALL… https://t.co/tUpsBfuaze
I still think Sonnet 3.5 and the o1 are a step above the Gemini-Exp models because LMSYS honestly sucks as a benchmark - it's mostly a benchmark of formatting and people pleasing skills I mean Anthropic has basically completely abandoned this benchmark. They only provide the… https://t.co/BSjG3qOlGB