Meta's Llama 4 series, including the Maverick and Scout models, has shown mixed performance across various benchmarks. On STEM-focused benchmarks, Llama 4 Behemoth reportedly outperforms GPT-4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro, though results for Gemini 2.5 Pro remain unreported. Maverick achieved an average score of 55.8% on BigCodeBench-Full, comparable to GPT-4o-2024-05-13 and DeepSeek V3, but underperformed on coding-specific benchmarks such as Aider and Kscores. Llama 4 Scout and Maverick models demonstrated strong inference speeds with NVIDIA hardware and Apple's M3 Ultra, achieving up to 42,000 tokens per second on TensorRT-LLM with Blackwell B200 and 50 tokens per second on an M3 Ultra. However, their coding performance lagged behind competitors, including Google's Gemini 2.5 Pro, which excels in coding tasks with a 72.9% accuracy on the Aider benchmark and a 1 million token context window. Llama 4 Scout ranked 97th out of 192 models on BigCodeBench-Hard. Google's Gemini 2.5 Pro has been recognized for its coding capabilities, outperforming Claude Sonnet and OpenAI's GPT models on multiple benchmarks. It has become a prominent choice for real-world code editing and refactoring tasks. Despite its strengths, some users have noted the absence of a 2 million token context window in newer Gemini models, requiring reliance on older versions like Gemini 1.5 Pro for such features.
Llama 4 Maverick is a low performer on the Aider coding benchmark. This is a good benchmark to track. It tests models on various popular programming languages (not just the usual Python). Gemini 2.5 Pro is clear top here. Maverick is on the third page of this table! https://t.co/5t0RrV0BPc https://t.co/SpRZ6z3LvU
The latest Aider polyglot benchmark just dropped, testing LLMs not just on code generation—but real-world code editing and refactoring. Topping the charts: 🔹 Gemini 2.5 Pro with 72.9% accuracy 🔹 Claude 3.7 Sonnet and DeepSeek R1 combos also showed serious strength 🔹 https://t.co/wOmZsd4WLc
Google’s AI Studio is seriously impressive — and free. You can even set system instructions. The only downside? Their newest models don’t support 2M context yet - you’ll need Gemini 1.5 Pro for that.