Microsoft has launched OmniParser V2, an advanced screen parsing tool that enhances the capabilities of large language models (LLMs) by enabling them to interact with computer screens. The new version is reported to be 60% faster than its predecessor, OmniParser V1, achieving sub-second latency on high-performance graphics cards like the NVIDIA GeForce RTX 4090. OmniParser V2 is designed to convert UI screenshots into structured data, allowing models such as GPT-4, DeepSeek R1, and Sonnet 3.5 to understand and act upon the information displayed on screens. The tool is open-source and available under the MIT license, making it accessible for integration with various models and agents. Additionally, it supports multiple operating systems, including Windows, macOS, Android, iOS, and web applications, thereby broadening its applicability in web automation and AI-driven tasks.
Recently in #AI+#Robotics: Microrobots and the 'lazy agent problem': Swarm study demonstrates a solution https://t.co/gWGI2xfiMy
Microsoft AI Releases OmniParser V2: An AI Tool that Turns Any LLM into a Computer Use Agent #OmniParserV2 #AIandGUI #MicrosoftAI #LLMdevelopment #TechInnovation https://t.co/9h4yMVogEe https://t.co/v8BBPkvv5X
Try out OmniParser-v2.0 https://t.co/Fyr7pATOQd https://t.co/1ItQ76CgFC https://t.co/3Wt2JKF0fM