User
Write something
GeminiLive S2S + pipecat-flows Integration Issue
Hey everyone! I'm trying to integrate GeminiLive S2S (speech-to-speech) with pipecat-flows for a healthcare booking agent. The Problem: When pipecat-flows transitions between nodes, it sends LLMSetToolsFrame to update available tools. GeminiLive requires WebSocket reconnection when tools change (API limitation). After reconnection, the conversation state breaks and Gemini doesn't follow the new node's task messages to call functions. What works: - OpenAI LLM + Azure STT + ElevenLabs TTS with pipecat-flows ✅ - Tool updates happen seamlessly, no reconnection needed What doesn't work: - GeminiLive S2S + pipecat-flows ❌ - Every node transition → reconnection → broken flow Current workaround attempts: - Monkey-patched process_frame to handle LLMSetToolsFrame - Wait for session ready after reconnection - Trigger inference with new context messages - Still inconsistent behavior Questions: 1. Has anyone successfully used GeminiLive with pipecat-flows? 2. Is there a recommended pattern for handling tool updates without reconnection? 3. Should I create a custom adapter that pre-registers all tools at connection time? Any guidance appreciated! 🙏
SOLVED: Deepgram Nova-3 (Polish) Fragmenting Phone Numbers despite `utterance_end_ms`
Hi everyone, I'm building a specialized voice assistant using **Pipecat Flows v0.0.22** and running into a frustrating issue with phone number collection that I can't seem to solve. ### The Stack - **Framework:** Pipecat Flows v0.0.22 (Python) - **STT:** Deepgram Nova-3 (Polish `pl`) - **TTS:** Cartesia (Polish voice) - **Transport:** Local WebRTC (browser-based, no telephony yet) ### The Problem When I dictate a 9-digit Polish phone number (e.g., "690807057"), the assistant receives partial fragments and processes them individually instead of waiting for the full number. For example, if I say "690... 807... 055" (with natural pauses), the bot splits it into: 1. "6" -> sent to LLM -> LLM complains "Received only 1 digit" 2. "980" -> sent to LLM -> LLM complains 3. "5" ... and so on. ### What I Have Tried I've gone through the documentation and tried several fixes, but the "defragmentation" issue persists. 1. **Deepgram Configuration (Current Setup):** I've configured the `LiveOptions` to handle phone numbers and utterance endings explicitly: ```python options = LiveOptions( model="nova-3", language="pl", smart_format=True, # Enabled numerals=True, # Enabled utterance_end_ms=1000, # Set to 1000ms to force waiting interim_results=True # Required for utterance_end_ms ) ``` *Result:* Even with `utterance_end_ms=1000`, Deepgram seems to finalize the results too early during the digit pauses. 2. **VAD Tuning:** - I tried increasing Pipecat's VAD `stop_secs` to `2.0s`. - *Result:* This caused massive latency (2s delay on every response) and didn't solve the valid STT fragmentation (Deepgram still finalized early). I've reverted to `0.5s` (and `0.2s` for barge-in) as `stop_secs=2.0s` is considered an anti-pattern for conversational flows. 3. **Prompt Engineering (Aggressive):** - I instructed the LLM to "call the function IMMEDIATELY with whatever fragments you have". - *Result:* This led to early failures where the LLM would call `capture_phone("6")`, which would fail validation (requires 9 digits), causing the bot to reject the input before the user finished speaking.
Pipecat VS Livekit
I'm just curious in what platforms are you building and the pros and cons of each one.
Special Welcome!
A special welcome to @Kwindla Kramer CEO of Daily (the team behind Pipecat)! I’m a big fan of his work and so glad to see him join this community. Make sure to follow him on LinkedIn!
Experts Advice Needed on my Pipecat Architecture
𝗛𝗲𝗮𝗹𝘁𝗵𝗰𝗮𝗿𝗲 𝗩𝗼𝗶𝗰𝗲 𝗔𝗴𝗲𝗻𝘁 𝗔𝗿𝗰𝗵𝗶𝘁𝗲𝗰𝘁𝘂𝗿𝗲 𝗥𝗲𝘃𝗶𝗲𝘄 Hi everyone, Running a production voice agent (~500-600 calls/day) with 𝗽𝗶𝗽𝗲𝗰𝗮𝘁-𝗳𝗹𝗼𝘄𝘀. Would appreciate feedback on my architecture. 𝗪𝗵𝘆 𝗦𝗲𝗹𝗳-𝗛𝗼𝘀𝘁𝗲𝗱: Tried Pipecat Cloud but Talkdesk is not supported. WebSocket is mandatory - cannot use WebRTC. 𝗔𝗿𝗰𝗵𝗶𝘁𝗲𝗰𝘁𝘂𝗿𝗲: Talkdesk ──WS──► Bridge Server (Azure App Service) ──WS──► Pipecat Agent (Azure VM + Docker) • Bridge converts μ-law 8kHz ↔ PCM 16kHz (resampling on every chunk) • 3 Docker containers behind Nginx load balancer • Each handles ~15 concurrent calls ──► Each container: 3GB RAM, 0.75 CPU limit • CI/CD: GitHub Actions → Docker Hub → Azure VM pull 𝗔𝗜 𝗦𝘁𝗮𝗰𝗸: • STT: Azure Speech (Italian) • LLM: OpenAI GPT-4.1 • TTS: ElevenLabs (eleven_multilingual_v2) • VAD: Silero 𝗠𝘂𝗹𝘁𝗶-𝗔𝗴𝗲𝗻𝘁 𝗦𝗲𝘁𝘂𝗽 (pipecat-flows): Router Node → detects intent → routes to: • Booking Agent (20+ step flow) • Info Agent (RAG/knowledge base) • [Future] Person specify the doctors name e.g "I want to book appointment with Dr. Jhon for heart checkup." Doctor Booking Agent Agents can transfer between each other during conversation. 𝗠𝘆 𝗤𝘂𝗲𝘀𝘁𝗶𝗼𝗻𝘀: 𝟭. 𝗟𝗮𝘁𝗲𝗻𝗰𝘆 feels high. Is the two-hop WebSocket architecture (Talkdesk → Bridge → Pipecat) causing this? Should I merge the bridge into the Pipecat container? 𝟮. Is having a 𝘀𝗲𝗽𝗮𝗿𝗮𝘁𝗲 𝗯𝗿𝗶𝗱𝗴𝗲 for audio conversion a common pattern, or is there a better approach? 𝟯. 𝗥𝗼𝘂𝘁𝗶𝗻𝗴 𝗽𝗮𝘁𝘁𝗲𝗿𝗻 𝗾𝘂𝗲𝘀𝘁𝗶𝗼𝗻: I use a Router node to detect intent and route to agents. But I'm concerned this approach is too rigid. Example: Currently I route to "Booking Agent" when user says "book X-ray". But what if user says "book with Dr. Jhon" or "book with Dr. Jhon at 3pm tomorrow"? Should I create separate agents for each variation? That feels wrong - they're all booking, just with different pre-filled data. Or should the Router extract entities (doctor name, time, service) and pass them as parameters to a single flexible agent that skips steps dynamically? What's the best pattern in pipecat-flows for handling these variations without creating rigid, bounded flows for each request type?
1-12 of 12
powered by
Open Source Voice AI Community
skool.com/open-source-voice-ai-community-6088
Voice AI made open: Learn to build voice agents with Livekit & Pipecat and uncover what the closed platforms are hiding.
Build your own community
Bring people together around your passion and get paid.
Powered by