Aug 22, 06:30 AM

Salesforce Benchmark Finds GPT-5 Solves Only 44% of Real Tasks

Salesforce researchers have released MCP-Universe, a new benchmark that evaluates how well large-language-model agents perform practical tasks that require calling external applications, fetching live data and taking sequential actions. The testbed is built on the open-source Model Context Protocol, which links agents to real servers across six tool categories. In initial results, OpenAI’s GPT-5 topped the leaderboard with a 43.72% success rate, followed by xAI’s Grok-4 at 33.33%, Anthropic’s Claude-4 Sonnet at 29.44% and Google’s Gemini-2.5 Pro at 22.08%. The figures suggest that even the best available systems fail more than half of the time when asked to complete end-to-end, enterprise-style workflows. Salesforce says the benchmark is intended to give developers and corporate buyers a reproducible way to measure real-world reliability rather than relying on synthetic academic tests. The company plans to update the suite as new models and enterprise tools emerge. The findings underscore the gap between rapid improvements in language understanding and the robustness required for production deployments, reinforcing industry guidance that organisations rigorously test agentic AI systems before integrating them into critical operations.

#Model Context Protocol #OpenAI #xAI #Anthropic #Google #Salesforce

Written with ChatGPT .