๐๐ฒ๐ฎ๐น๐๐ต๐ฐ๐ฎ๐ฟ๐ฒ ๐ฉ๐ผ๐ถ๐ฐ๐ฒ ๐๐ด๐ฒ๐ป๐ ๐๐ฟ๐ฐ๐ต๐ถ๐๐ฒ๐ฐ๐๐๐ฟ๐ฒ ๐ฅ๐ฒ๐๐ถ๐ฒ๐
Hi everyone,
Running a production voice agent (~500-600 calls/day) with ๐ฝ๐ถ๐ฝ๐ฒ๐ฐ๐ฎ๐-๐ณ๐น๐ผ๐๐. Would appreciate feedback on my architecture.
๐ช๐ต๐ ๐ฆ๐ฒ๐น๐ณ-๐๐ผ๐๐๐ฒ๐ฑ: Tried Pipecat Cloud but Talkdesk is not supported. WebSocket is mandatory - cannot use WebRTC.
๐๐ฟ๐ฐ๐ต๐ถ๐๐ฒ๐ฐ๐๐๐ฟ๐ฒ:
Talkdesk โโWSโโโบ Bridge Server (Azure App Service) โโWSโโโบ Pipecat Agent (Azure VM + Docker)
โข Bridge converts ฮผ-law 8kHz โ PCM 16kHz (resampling on every chunk)
โข 3 Docker containers behind Nginx load balancer
โข Each handles ~15 concurrent calls โโโบ Each container: 3GB RAM, 0.75 CPU limit
โข CI/CD: GitHub Actions โ Docker Hub โ Azure VM pull
๐๐ ๐ฆ๐๐ฎ๐ฐ๐ธ:
โข STT: Azure Speech (Italian)
โข LLM: OpenAI GPT-4.1
โข TTS: ElevenLabs (eleven_multilingual_v2)
โข VAD: Silero
๐ ๐๐น๐๐ถ-๐๐ด๐ฒ๐ป๐ ๐ฆ๐ฒ๐๐๐ฝ (pipecat-flows):
Router Node โ detects intent โ routes to:
โข Booking Agent (20+ step flow)
โข Info Agent (RAG/knowledge base)
โข [Future] Person specify the doctors name e.g "I want to book appointment with Dr. Jhon for heart checkup." Doctor Booking Agent
Agents can transfer between each other during conversation.
๐ ๐ ๐ค๐๐ฒ๐๐๐ถ๐ผ๐ป๐:
๐ญ. ๐๐ฎ๐๐ฒ๐ป๐ฐ๐ feels high. Is the two-hop WebSocket architecture (Talkdesk โ Bridge โ Pipecat) causing this? Should I merge the bridge into the Pipecat container?
๐ฎ. Is having a ๐๐ฒ๐ฝ๐ฎ๐ฟ๐ฎ๐๐ฒ ๐ฏ๐ฟ๐ถ๐ฑ๐ด๐ฒ for audio conversion a common pattern, or is there a better approach?
๐ฏ. ๐ฅ๐ผ๐๐๐ถ๐ป๐ด ๐ฝ๐ฎ๐๐๐ฒ๐ฟ๐ป ๐พ๐๐ฒ๐๐๐ถ๐ผ๐ป: I use a Router node to detect intent and route to agents. But I'm concerned this approach is too rigid.
Example: Currently I route to "Booking Agent" when user says "book X-ray". But what if user says "book with Dr. Jhon" or "book with Dr. Jhon at 3pm tomorrow"?
Should I create separate agents for each variation? That feels wrong - they're all booking, just with different pre-filled data.
Or should the Router extract entities (doctor name, time, service) and pass them as parameters to a single flexible agent that skips steps dynamically?
What's the best pattern in pipecat-flows for handling these variations without creating rigid, bounded flows for each request type?
๐ฐ. What are you using for ๐ผ๐ฏ๐๐ฒ๐ฟ๐๐ฎ๐ฏ๐ถ๐น๐ถ๐๐ in production?
Any feedback appreciated. Thanks!