Hi everyone,
I'm building a specialized voice assistant using **Pipecat Flows v0.0.22** and running into a frustrating issue with phone number collection that I can't seem to solve.
### The Stack
- **Framework:** Pipecat Flows v0.0.22 (Python)
- **STT:** Deepgram Nova-3 (Polish `pl`)
- **TTS:** Cartesia (Polish voice)
- **Transport:** Local WebRTC (browser-based, no telephony yet)
### The Problem
When I dictate a 9-digit Polish phone number (e.g., "690807057"), the assistant receives partial fragments and processes them individually instead of waiting for the full number.
For example, if I say "690... 807... 057" (with natural pauses), the bot splits it into:
1. "6" -> sent to LLM -> LLM complains "Received only 1 digit"
2. "980" -> sent to LLM -> LLM complains
3. "5" ... and so on.
### What I Have Tried
I've gone through the documentation and tried several fixes, but the "defragmentation" issue persists.
1. **Deepgram Configuration (Current Setup):**
I've configured the `LiveOptions` to handle phone numbers and utterance endings explicitly:
```python
options = LiveOptions(
model="nova-3",
language="pl",
smart_format=True, # Enabled
numerals=True, # Enabled
utterance_end_ms=1000, # Set to 1000ms to force waiting
interim_results=True # Required for utterance_end_ms
)
```
*Result:* Even with `utterance_end_ms=1000`, Deepgram seems to finalize the results too early during the digit pauses.
2. **VAD Tuning:**
- I tried increasing Pipecat's VAD `stop_secs` to `2.0s`.
- *Result:* This caused massive latency (2s delay on every response) and didn't solve the valid STT fragmentation (Deepgram still finalized early). I've reverted to `0.5s` (and `0.2s` for barge-in) as `stop_secs=2.0s` is considered an anti-pattern for conversational flows.
3. **Prompt Engineering (Aggressive):**
- I instructed the LLM to "call the function IMMEDIATELY with whatever fragments you have".
- *Result:* This led to early failures where the LLM would call `capture_phone("6")`, which would fail validation (requires 9 digits), causing the bot to reject the input before the user finished speaking.
- I also tried the opposite: "Wait for the full number". But since Deepgram sends final frames individually (e.g., "6", then "9..."), the Flow system triggers the node for each fragment as a separate turn, so the LLM never sees the full accumulated context.
4. **Slow Explicit Dictation Test:**
- Today I tried being very explicit, separating each digit with significant pauses and speaking very slowly.
- *Result:* It **did** manage to record the full phone number correctly when I was extremely slow and deliberate. However, this requires an unnatural speaking pace that won't work for all users.
### Log Snippet
Here is what I see in the logs. Note the multiple sanitized inputs for a single attempt:
```
INFO | app.processors.user_input_sanitization | ✅ Sanitized text input: '6'
INFO | app.processors.llm_response_filter | 🤖 Clean LLM response: 'Otrzymałem tylko 1 cyfrę...'
...
INFO | app.processors.user_input_sanitization | ✅ Sanitized text input: '980'
...
INFO | app.processors.user_input_sanitization | ✅ Sanitized text input: '5,'
```
### Request for Help
Has anyone successfully implemented reliable phone number collection with Deepgram Nova-3 in Pipecat Flows?
- Is there a specific Deepgram parameter I'm missing for Polish?
- Should I be handling this "accumulation" logic manually in a custom Processor or within the Flow state?
- What is the best method to record phone numbers to accommodate different speaking paces? Have you experienced similar issues where users must speak unnaturally slowly?
Any advice would be appreciated!