Help Needed: Deepgram Nova-3 (Polish) Fragmenting Phone Numbers despite `utterance_end_ms`

Hi everyone,

I'm building a specialized voice assistant using **Pipecat Flows v0.0.22** and running into a frustrating issue with phone number collection that I can't seem to solve.

### The Stack

- **Framework:** Pipecat Flows v0.0.22 (Python)

- **STT:** Deepgram Nova-3 (Polish `pl`)

- **TTS:** Cartesia (Polish voice)

- **Transport:** Local WebRTC (browser-based, no telephony yet)

### The Problem

When I dictate a 9-digit Polish phone number (e.g., "690807057"), the assistant receives partial fragments and processes them individually instead of waiting for the full number.

For example, if I say "690... 807... 057" (with natural pauses), the bot splits it into:

1. "6" -> sent to LLM -> LLM complains "Received only 1 digit"

2. "980" -> sent to LLM -> LLM complains

3. "5" ... and so on.

### What I Have Tried

I've gone through the documentation and tried several fixes, but the "defragmentation" issue persists.

1. **Deepgram Configuration (Current Setup):**

I've configured the `LiveOptions` to handle phone numbers and utterance endings explicitly:

```python

options = LiveOptions(

model="nova-3",

language="pl",

smart_format=True, # Enabled

numerals=True, # Enabled

utterance_end_ms=1000, # Set to 1000ms to force waiting

interim_results=True # Required for utterance_end_ms

)

```

*Result:* Even with `utterance_end_ms=1000`, Deepgram seems to finalize the results too early during the digit pauses.

2. **VAD Tuning:**

- I tried increasing Pipecat's VAD `stop_secs` to `2.0s`.

- *Result:* This caused massive latency (2s delay on every response) and didn't solve the valid STT fragmentation (Deepgram still finalized early). I've reverted to `0.5s` (and `0.2s` for barge-in) as `stop_secs=2.0s` is considered an anti-pattern for conversational flows.

3. **Prompt Engineering (Aggressive):**

- I instructed the LLM to "call the function IMMEDIATELY with whatever fragments you have".

- *Result:* This led to early failures where the LLM would call `capture_phone("6")`, which would fail validation (requires 9 digits), causing the bot to reject the input before the user finished speaking.

- I also tried the opposite: "Wait for the full number". But since Deepgram sends final frames individually (e.g., "6", then "9..."), the Flow system triggers the node for each fragment as a separate turn, so the LLM never sees the full accumulated context.

4. **Slow Explicit Dictation Test:**

- Today I tried being very explicit, separating each digit with significant pauses and speaking very slowly.

- *Result:* It **did** manage to record the full phone number correctly when I was extremely slow and deliberate. However, this requires an unnatural speaking pace that won't work for all users.

### Log Snippet

Here is what I see in the logs. Note the multiple sanitized inputs for a single attempt:

```

INFO | app.processors.user_input_sanitization | ✅ Sanitized text input: '6'

INFO | app.processors.llm_response_filter | 🤖 Clean LLM response: 'Otrzymałem tylko 1 cyfrę...'

...

INFO | app.processors.user_input_sanitization | ✅ Sanitized text input: '980'

...

INFO | app.processors.user_input_sanitization | ✅ Sanitized text input: '5,'

```

### Request for Help

Has anyone successfully implemented reliable phone number collection with Deepgram Nova-3 in Pipecat Flows?

- Is there a specific Deepgram parameter I'm missing for Polish?

- Should I be handling this "accumulation" logic manually in a custom Processor or within the Flow state?

- What is the best method to record phone numbers to accommodate different speaking paces? Have you experienced similar issues where users must speak unnaturally slowly?

Any advice would be appreciated!

11 comments

Voice AI HQ

skool.com/artilo-ai-6501

For developers, entrepreneurs, and anyone sick of voice AI hype without results.

Bring people together around your passion and get paid.