OpenAI launches three new real-time audio API models

OpenAI has launched three new real-time audio models through its API, pushing voice AI from basic question-and-answer interactions toward agents that can listen, reason, translate, and act within a single live conversation. The release also marks the Realtime API's exit from beta, making it generally available for production use for the first time.
At the center of the release is GPT-Realtime-2, OpenAI's first voice model built on GPT-5-class reasoning. Unlike the step-by-step architecture that most voice systems rely on, GPT-Realtime-2 processes audio in a continuous stream, allowing it to interpret speech as it happens and respond without the gap caused by separate transcription and synthesis stages. The model supports a 128K token context window, up from 32K in the previous version, which makes longer voice sessions and complex multi-step agentic flows practical without external memory scaffolding.
What GPT-Realtime-2 can do
The model is built specifically for what OpenAI calls "agentic behaviour" during voice calls. Preambles let it say "Let me check that" or "One moment" while it executes tool calls, so users are not left with dead air. Parallel tool calls let it run multiple back-end requests simultaneously and narrate which one is in flight. Stronger recovery behavior means it handles failures out loud rather than freezing mid-conversation. Tone adjustment lets it shift between styles based on context: more measured for support calls and more upbeat for confirmations.
GPT-Realtime-2 scores 15.2% higher than GPT-Realtime-1.5 on Big Bench Audio, OpenAI's audio reasoning benchmark, and 13.8% higher on Audio Multichallenger for instruction following. In real-world testing, Zillow reports a 26-point lift in call success rate on its hardest adversarial benchmark, going from 69% to 95% after prompt optimization on GPT-Realtime-2. The model is priced at $32 per million audio input tokens and $64 per million audio output tokens, with $0.40 per million cached input tokens.
GPT-Realtime-Translate and GPT-Realtime-Whisper
The second model, GPT-Realtime-Translate, is a dedicated live speech translation system. It processes spoken input continuously and outputs translations in real time without requiring speakers to pause or finish complete sentences. The model supports more than 70 input languages and 13 output languages, targeting customer support, education, live events, and cross-border sales environments. BolnaAI, a voice AI company building for Indian language markets, reports 12.5% lower word error rates on Hindi, Tamil, and Telugu compared to the previous translation approach. GPT-Realtime-Translate is priced at $0.034 per minute of audio processing.
GPT-Realtime-Whisper is the third model, extending OpenAI's widely adopted Whisper speech recognition technology into a streaming system. Where the original Whisper was built for post-recording transcription, this version produces live captions as speech is being spoken. The use cases include live meetings, courtroom documentation, newsroom transcription, and accessibility tools for hearing-impaired users. It is the most affordable of the three at $0.017 per minute. All three models are available now through the OpenAI API and the developer playground.
The launch also adds MCP server support, image input capabilities, and SIP phone calling integration to the Realtime API, broadening the range of enterprise telephony and agentic workflows developers can build without leaving the API.
The AI tool space has also attracted attackers looking to exploit interest in new products. Notebookcheck reported yesterday on a fake Claude AI website that was pushing the Beagle Windows backdoor through Google-sponsored search results using a trojanized Claude-Pro Relay installer.

















