One of the projects I worked on is an AI voice agent platform called SeaVoice. The goal of this system is to automate business phone calls using conversational AI, replacing traditional IVR systems with natural voice interactions.
From a technical perspective, the system is built as a real-time conversational pipeline. When a user calls the system, the audio stream first goes through a streaming Automatic Speech Recognition (ASR) service that converts the caller’s speech into text. That transcript is then processed by a conversational AI layer powered by a large language model.
To improve response accuracy and reduce hallucinations, we implemented a Retrieval-Augmented Generation architecture. The system retrieves relevant information from a vector-based knowledge base and injects that context into the LLM prompt before generating a response. This allows the AI agent to provide accurate answers related to company services, FAQs, or product data.
Once the response is generated, it is passed to a neural Text-to-Speech engine, which converts the text back into natural-sounding audio that is streamed to the caller. The system also includes a dialogue management layer that maintains conversation state so the AI can handle multi-turn interactions and follow-up questions.
One of the biggest challenges we faced was reducing latency in the voice interaction pipeline, since ASR processing, LLM inference, and TTS generation can introduce delays. To address this, we implemented streaming transcription, asynchronous model inference, and incremental TTS playback so the system could start responding before the full pipeline finished.
Overall, this project involved integrating multiple AI components—including speech recognition, large language models, vector search, and neural speech synthesis—into a scalable architecture capable of supporting real-time voice conversations.