Apr 7, 10:00 AM

Meta's Llama 4 Behemoth Trails Google's Gemini 2.5 Pro in Coding Benchmarks with 72.9% Accuracy

Meta's Llama 4 series, including the Maverick and Scout models, has shown mixed performance across various benchmarks. On STEM-focused benchmarks, Llama 4 Behemoth reportedly outperforms GPT-4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro, though results for Gemini 2.5 Pro remain unreported. Maverick achieved an average score of 55.8% on BigCodeBench-Full, comparable to GPT-4o-2024-05-13 and DeepSeek V3, but underperformed on coding-specific benchmarks such as Aider and Kscores. Llama 4 Scout and Maverick models demonstrated strong inference speeds with NVIDIA hardware and Apple's M3 Ultra, achieving up to 42,000 tokens per second on TensorRT-LLM with Blackwell B200 and 50 tokens per second on an M3 Ultra. However, their coding performance lagged behind competitors, including Google's Gemini 2.5 Pro, which excels in coding tasks with a 72.9% accuracy on the Aider benchmark and a 1 million token context window. Llama 4 Scout ranked 97th out of 192 models on BigCodeBench-Hard. Google's Gemini 2.5 Pro has been recognized for its coding capabilities, outperforming Claude Sonnet and OpenAI's GPT models on multiple benchmarks. It has become a prominent choice for real-world code editing and refactoring tasks. Despite its strengths, some users have noted the absence of a 2 million token context window in newer Gemini models, requiring reliance on older versions like Gemini 1.5 Pro for such features.