Salesforce researchers have released MCP-Universe, a new benchmark that evaluates how well large-language-model agents perform practical tasks that require calling external applications, fetching live data and taking sequential actions. The testbed is built on the open-source Model Context Protocol, which links agents to real servers across six tool categories. In initial results, OpenAI’s GPT-5 topped the leaderboard with a 43.72% success rate, followed by xAI’s Grok-4 at 33.33%, Anthropic’s Claude-4 Sonnet at 29.44% and Google’s Gemini-2.5 Pro at 22.08%. The figures suggest that even the best available systems fail more than half of the time when asked to complete end-to-end, enterprise-style workflows. Salesforce says the benchmark is intended to give developers and corporate buyers a reproducible way to measure real-world reliability rather than relying on synthetic academic tests. The company plans to update the suite as new models and enterprise tools emerge. The findings underscore the gap between rapid improvements in language understanding and the robustness required for production deployments, reinforcing industry guidance that organisations rigorously test agentic AI systems before integrating them into critical operations.
I'm shocked that many people still aren’t using AI tools. Most people only know about ChatGPT. Here are 12 hidden gems you need to know: https://t.co/qgAz8xDIRi
Generative AI integrations are here What have you used AI for in your portfolio?
I don’t get why so many people aren’t using AI tools. Most only know about ChatGPT, but there’s so much more out there. Here are some hidden gems you should check out: ↓ https://t.co/2BDpxLsnl6