Karol Meguid

Hi everyone, I'm building a specialized voice assistant using **Pipecat Flows v0.0.22** and running into a frustrating issue with phone number collection that I can't seem to solve. ### The Stack - **Framework:** Pipecat Flows v0.0.22 (Python) - **STT:** Deepgram Nova-3 (Polish `pl`) - **TTS:** Cartesia (Polish voice) - **Transport:** Local WebRTC (browser-based, no telephony yet) ### The Problem When I dictate a 9-digit Polish phone number (e.g., "690807057"), the assistant receives partial fragments and processes them individually instead of waiting for the full number. For example, if I say "690... 807... 057" (with natural pauses), the bot splits it into: 1. "6" -> sent to LLM -> LLM complains "Received only 1 digit" 2. "980" -> sent to LLM -> LLM complains 3. "5" ... and so on. ### What I Have Tried I've gone through the documentation and tried several fixes, but the "defragmentation" issue persists. 1. **Deepgram Configuration (Current Setup):** I've configured the `LiveOptions` to handle phone numbers and utterance endings explicitly: ```python options = LiveOptions( model="nova-3", language="pl", smart_format=True, # Enabled numerals=True, # Enabled utterance_end_ms=1000, # Set to 1000ms to force waiting interim_results=True # Required for utterance_end_ms ) ``` *Result:* Even with `utterance_end_ms=1000`, Deepgram seems to finalize the results too early during the digit pauses. 2. **VAD Tuning:** - I tried increasing Pipecat's VAD `stop_secs` to `2.0s`. - *Result:* This caused massive latency (2s delay on every response) and didn't solve the valid STT fragmentation (Deepgram still finalized early). I've reverted to `0.5s` (and `0.2s` for barge-in) as `stop_secs=2.0s` is considered an anti-pattern for conversational flows. 3. **Prompt Engineering (Aggressive):** - I instructed the LLM to "call the function IMMEDIATELY with whatever fragments you have". - *Result:* This led to early failures where the LLM would call `capture_phone("6")`, which would fail validation (requires 9 digits), causing the bot to reject the input before the user finished speaking.

New comment 4d ago

Karol Meguid

0 likes • 4d

@Arek Wu i find cartesia model the best at this moment, is the quickest and just more human voice than 11labs, but there are not many choices to pick there unfortunately. You can always try to see how 11labs tts works for you, i think it's a little bit more reliable as it's also a Polish startup, but the downside is it is just slower and not sound as good as cartesia models as it is for now.

Karol Meguid

0 likes • 4d

@Arek Wu Yeah, of course everything is in the system prompt: correct spelling, hours, dates, prices, and numbers. I specifically had trouble with the correct spelling of the business ID — it kept saying “NIP” in English instead of Polish no matter what. After a long battle, I finally found a way to prompt it so that it speaks correctly without any tweaks.Btw i send you a dm on LinkedIn, would really appreciate the feedback :)

1-1 of 1

Level 1

2points to level up

Karol Meguid

@karol-meguid-4858

Just a 19-year old student from Poland

Active 1h ago

Joined Dec 29, 2025

Contributions

Followers

Following