
Recent evaluations of AI models have highlighted the performance of Gemini 1.5 Pro 0827, which achieved a score of 67% on the Aider's code editing benchmark. This places it just slightly above Llama 405b, which scored 66%. In comparison, the Sonnet model led with a score of 77%, while GPT 3.5 Turbo 0301 and Gemini 1.5 Flash 0827 followed with scores of 58% and 53%, respectively. Additionally, Gemini has released another fine-tuned version of its model, which reportedly has only made minor improvements. In a separate assessment of structured output capabilities, Gemini 1.5 was rated as 'OK', while OpenAI's GPT-4o was recognized as the best model due to its direct Pydantic integration. Claude 3.5 was rated second, requiring a 'tool call' trick for optimal performance. Gemini 1.5 Flash was noted for outperforming GPT-4o-mini in most categories, except for coding tasks.


Nice work on controlling style biases! In this view, many models are no longer inflated (e.g., response length, formatting). Gemini 1.5 Flash also outperforms gpt-4o-mini overall and across all categories except for coding. https://t.co/KPUpCgBdm2
🦉 Are you working to improve your LLM-based analytics or building LLM agents? We tested structured output in Gemini Pro, Claude, and GPT. Results: 🥇 OpenAI GPT-4o: Best. Direct Pydantic integration. ��� Claude 3.5: Good. Needs 'tool call' trick. 🥉 Gemini 1.5: OK. Clunky… https://t.co/usS6uTTjFG
🦉Are you working to improve your LLM-based analytics or building LLM agents? We tested structured output in Gemini Pro, Claude, and GPT. Results: 🥇 OpenAI GPT-4o: Best. 👉🏽 Direct Pydantic integration. 🥈 Claude 3.5: Good. 👉🏽 Needs 'tool call' trick. 🥉 Gemini 1.5: OK. 👉🏽…