OpenAI Developing Real-Time Voice Model to Make ChatGPT Conversations Flow More Naturally

OpenAI is working on a new audio model — referred to as bidirectional or BiDi — intended to allow ChatGPT to process speech continuously and alter its responses in real time when a user interjects. The current Advanced Voice Mode operates on a turn-taking basis and cannot handle interruptions smoothly. The BiDi prototype shows promise for more natural interactions and customer support use cases, but early tests reveal glitches after several minutes of use and the release window may slip beyond initial plans.

Key Points

OpenAI is developing a bidirectional audio model, known as BiDi, to let ChatGPT adapt its speech in real time when users interrupt, addressing limitations of the current turn-based Advanced Voice Mode.
BiDi could make voice interactions more natural and is expected to be useful for customer support and other applications that require smooth conversational pivots; it also reportedly handles external tools and applications better than current models.
The prototype exhibits instability - including glitches and abnormal-sounding voices after several minutes - and the intended release, originally aimed for the first quarter, may slip into the second quarter or later.

OpenAI is building a next-generation audio model designed to make spoken interactions with ChatGPT feel less mechanical by enabling the system to respond dynamically when a user interrupts, according to a person familiar with the work.

The company’s present Advanced Voice Mode relies on a turn-based approach: the user finishes speaking and then the model processes the audio and returns a reply. Under this setup, brief interjections from the listener - such as "okay" or "mm-hm" - cause the system to halt rather than continue the exchange in a natural way.

The new model, described internally as bidirectional or BiDi, is intended to continuously analyze a speaker’s voice so the AI can modify its response mid-sentence if the human cuts in. That capability would contrast with existing voice models that generate fixed, unchangeable responses once they begin speaking.

Prototype behavior and readiness

The BiDi prototype has demonstrated the ability to react to interruptions, but it is not yet polished enough for public release. The person with knowledge of the project said the model tends to glitch or switch into abnormal-sounding voices after a few minutes of back-and-forth conversation. Those stability issues have delayed the rollout: researchers had hoped to ship BiDi within the first quarter of the year, but that timetable may now move into the second quarter or later.

Rationale and potential applications

OpenAI sees narrowing the performance gap between voice interfaces and text-based models as a way to broaden AI adoption. The company believes many people prefer speaking to an assistant rather than typing, and a real-time audio model could lower the friction of voice-based interactions.

One anticipated application is customer support. In a retail call scenario, for example, a customer who begins a request to return an item but then decides to exchange it could be handled more seamlessly by an agent running BiDi, which would allow the conversation to pivot smoothly rather than the agent stopping or becoming confused. The person familiar with the project also said the model is better at using external tools and applications compared with current voice capabilities.

OpenAI has previously said it plans to enhance its audio model for a prospective AI-focused device that would rely primarily on voice interaction. The company is considering developing a smart speaker that could perform tasks such as checking emails or booking reservations through spoken commands.

Limitations and timeline uncertainty

While BiDi promises to make voice-based AI feel more conversational, the technology’s present instability after extended exchanges is a material constraint on immediate deployment. The timeline for broader availability is therefore uncertain and dependent on resolving those technical issues.

OpenAI’s belief that improved voice performance would expand global AI use rests on the assumption that more natural spoken interaction will encourage adoption; however, the company must first address the prototype’s glitching and audio anomalies before that potential can be realized.

Risks

Technical instability: The BiDi prototype reportedly begins to glitch or produce abnormal-sounding voices after a few minutes of dialogue, creating a risk for reliable deployment in consumer and enterprise settings - affecting consumer tech and contact center solutions.
Timeline uncertainty: Initial plans to release the model in the first quarter may be delayed into the second quarter or beyond, introducing uncertainty for businesses planning voice-based product rollouts - impacting device makers and service providers.
Performance gap: Until voice models match the robustness of text-based systems, adoption of voice-first AI assistants may be limited, which could slow uptake by users and enterprises relying on seamless voice interactions - influencing smart speaker and AI assistant markets.

Menu

Key Points

Risks

More from Economy