Some new Voice models in OpenAI · AI Automation Society

Some new Voice models in OpenAI

Sharing - Got this email today from OpenAI about some new voice models they have released as I have used some in my projects...

Hello,

Thanks for using OpenAI’s audio models. Today, I’m excited to share that we have three new audio models in the API. They offer significant improvements in speech-to-text and text-to-speech capabilities, making it possible to build more powerful, customizable, and intelligent voice agents that can act as true conversational partners. We’re also updating our Agents SDK to support the new models, making it possible to convert any text-based agent into an audio agent with a few lines of code.

Speech-to-text

You can now use `gpt-4o-transcribe` and `gpt-4o-mini-transcribe` in use cases ranging from customer service voice agents to transcribing meeting notes. We’ve added bidirectional streaming so you can stream audio in, and get a stream of text back. And the streaming API supports built-in noise cancellation and a new semantic voice activity detector so you can opt for transcriptions only when the user has finished their thought (useful for building voice agents!). These models outperform Whisper, offering better accuracy and performance. For more, check out our docs.

Text-to-speech

With the new `gpt-4o-mini-tts` model, you can precisely control the tone, emotion, and speed of generated voices, creating more natural and engaging experiences. Starting with 10 preset voices, you can use prompts to customize speech for specific scenarios. This enables a wide range of use cases, from more empathetic and dynamic customer service voices to expressive narration for creative storytelling experiences. We’ve also built OpenAI.fm, a demo where you can try our new TTS model under our beta terms. You can read the docs to get started.

Agents SDK updates

You can now add audio capabilities to text agents by including speech-to-text and text-to-speech endcaps with just a few lines of code. To get started, visit the Agents SDK docs.

With these new models, we want to give you more choices in how you build voice agents. If you already have a text-based agent or have a voice agent powered by speech-to-text and text-to-speech pipeline, using the new models with the Agents SDK is the best way to get started. If you’re looking to build low-latency speech-to-speech experiences, we recommend building with our speech-to-speech models in the Realtime API. You can read more about these new models in our blog, and if you have any questions, please feel free to join our developer forum.

Best,

Jeff Harris

OpenAI API TPM

1 comment