Aug 28, 05:31 PM

OpenAI Debuts gpt-realtime Voice Model as Realtime API Enters General Availability

OpenAI has taken its Realtime API out of beta and made it generally available, adding support for remote Model Context Protocol servers, image inputs and Session Initiation Protocol phone calling. The company also introduced gpt-realtime, described as its most advanced speech-to-speech model, with two new voices, Cedar and Marin. Internal evaluations show the model scoring 82.8% on BigBench Audio reasoning tasks and 30.5% on MultiChallenge instruction-following, sharp improvements over the December-2024 release. Pricing has been reduced by 20% to $32 per million audio input tokens and $64 per million output tokens. The Realtime API now processes audio in a single pass instead of chaining speech-to-text and text-to-speech components, lowering latency for production voice agents. OpenAI says the model better adheres to developer instructions, captures non-verbal cues and can switch languages mid-sentence, while reusable prompts aim to simplify deployment at scale. The launch comes the same day Microsoft unveiled its first in-house AI systems, including MAI-Voice-1, a speech engine that can generate a minute of audio in under a second on one GPU, and MAI-1-preview, a text model trained on about 15,000 Nvidia H100 chips. The parallel releases underscore intensifying competition between the long-time partners as they race to supply faster and cheaper voice and multimodal capabilities to developers.