Llama 3 405B, a language model, has shown significant improvements in various benchmarks over the past months. It achieved an MMLU score of 86 four months ago, compared to Claude 3.5 Sonnet's 88 today. On the Big-Bench Hard, Llama scored 85 while Sonnet scored 93. Additionally, Llama had an 83 on Drop, whereas Sonnet had 87. Despite these differences, there is speculation that Llama 3 405B may have surpassed Sonnet 3.5 if it continued improving. However, it is noted that Llama 3 405B is significantly behind GPT-4o in human evaluations. The performance of Llama 3 405B compared to other models like Claude 3.5 Sonnet and OpenAI's GPT-4o is a topic of interest, particularly in the lmsys arena leaderboard.
Will Llama 3 405b rank higher than Claude 3.5 Sonnet and GPT-4o in the lmsys arena leaderboard?
It seems that Llama-3-405b is significantly behind GPT-4o in human evaluations. https://t.co/Qkem851FeU
If llama 3 is this good, and if claude 3.5 sonnet is this good... Think about how good GPT-5 is going to be