OpenAI is building a next-generation audio model designed to make spoken interactions with ChatGPT feel less mechanical by enabling the system to respond dynamically when a user interrupts, according to a person familiar with the work.
The company’s present Advanced Voice Mode relies on a turn-based approach: the user finishes speaking and then the model processes the audio and returns a reply. Under this setup, brief interjections from the listener - such as "okay" or "mm-hm" - cause the system to halt rather than continue the exchange in a natural way.
The new model, described internally as bidirectional or BiDi, is intended to continuously analyze a speaker’s voice so the AI can modify its response mid-sentence if the human cuts in. That capability would contrast with existing voice models that generate fixed, unchangeable responses once they begin speaking.
Prototype behavior and readiness
The BiDi prototype has demonstrated the ability to react to interruptions, but it is not yet polished enough for public release. The person with knowledge of the project said the model tends to glitch or switch into abnormal-sounding voices after a few minutes of back-and-forth conversation. Those stability issues have delayed the rollout: researchers had hoped to ship BiDi within the first quarter of the year, but that timetable may now move into the second quarter or later.
Rationale and potential applications
OpenAI sees narrowing the performance gap between voice interfaces and text-based models as a way to broaden AI adoption. The company believes many people prefer speaking to an assistant rather than typing, and a real-time audio model could lower the friction of voice-based interactions.
One anticipated application is customer support. In a retail call scenario, for example, a customer who begins a request to return an item but then decides to exchange it could be handled more seamlessly by an agent running BiDi, which would allow the conversation to pivot smoothly rather than the agent stopping or becoming confused. The person familiar with the project also said the model is better at using external tools and applications compared with current voice capabilities.
OpenAI has previously said it plans to enhance its audio model for a prospective AI-focused device that would rely primarily on voice interaction. The company is considering developing a smart speaker that could perform tasks such as checking emails or booking reservations through spoken commands.
Limitations and timeline uncertainty
While BiDi promises to make voice-based AI feel more conversational, the technology’s present instability after extended exchanges is a material constraint on immediate deployment. The timeline for broader availability is therefore uncertain and dependent on resolving those technical issues.
OpenAI’s belief that improved voice performance would expand global AI use rests on the assumption that more natural spoken interaction will encourage adoption; however, the company must first address the prototype’s glitching and audio anomalies before that potential can be realized.