The rapid advancement of the "Voice Stack" is driving new voice-based applications, thanks to foundation models trained for audio processing. OpenAI’s RealTime API enables easy development of voice-in, voice-out experiences, useful for prototyping and casual interactions. However, controlling these models’ output remains a challenge compared to text-based systems, which benefit from guardrails and agentic reasoning workflows.
To ensure accurate responses, a more structured pipeline—speech-to-text (STT) → LLM processing → text-to-speech (TTS)—is often used. While this approach improves control, it introduces latency, which is a major concern for voice applications. To mitigate this, developers have implemented a "pre-response" technique: an initial quick reply to acknowledge the user while a more thoughtful response is generated.
This approach has helped reduce latency to 0.5-1 seconds, closer to natural human conversation speeds. Prototyping voice applications is now easier than ever, and developers are encouraged to experiment with these tools. Future work will continue refining voice application technology and sharing best practices.