Recent developments in large language models (LLMs) have seen significant advancements from major players such as Meta, Google, Anthropic, and OpenAI. Meta's Llama 3.1 has shown notable improvements, with its 405B model achieving state-of-the-art (SOTA) status in MMLU-Pro, surpassing Anthropic's Claude 3.5 Sonnet. Additionally, Llama 3.1's performance in zero-shot MATH evaluations positions it between Claude 3.5 Sonnet and OpenAI's GPT-4o. Furthermore, Google's Gemini 1.5 Pro has outperformed its predecessor, 3.0 Ultra. These advancements suggest that architectural and data tweaks have led to substantial performance gains across the board.
Llama3.1 Instruct tune evaluations are uh, interesting Any benchmarking should be taken with perhaps copius amounts of salt, but; - 405b is SOTA in MMLU-Pro, replacing 3.5 Sonnet - zero-shot MATH is somewhere inbetween 3.5 Sonnet and 4o. https://t.co/n1tiDnxeL2
Will Llama 3 405b rank higher than Claude 3.5 Sonnet and GPT-4o in the lmsys arena leaderboard?
Feels like Meta, Google, Claude, and OpenAI all found (the same?) architectural/data tweaks that made big differences Hence 3.5 Sonnet beating 3 Opus, Gemini 1.5 Pro beating 1.0 Ultra, and now Llama 3.1 https://t.co/czdAKmK2bF