Activity
Mon
Wed
Fri
Sun
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
Jan
What is this?
Less
More

Memberships

Open Source Voice AI Community

804 members โ€ข Free

11 contributions to Open Source Voice AI Community
New NVIDIA open model for voice agents: Nemotron Speech ASR
NVIDIA released a new open source speech-to-text model designed from the ground up for low-latency use cases like voice agents. This is part of NVIDIA's new focus on open models, which I'm excited about. These new models in the Nemotron family include STT and TTS models, specialized models like guardrail models and LLMs. And they are completely open: open weights, training code, training data sets, and inference tooling. This new STT model is very fast. Here's a voice agent running locally on my RTX 5090 with sub-500ms voice-to-voice inference. Technical write-up and link to GitHub repo: https://www.daily.co/blog/building-voice-agents-with-nvidia-open-models/ Also, Twitter and LinkedIn if either of those platforms are your thing. (I post a lot about voice agents on both platforms.) https://x.com/kwindla/status/2008601714392514722 https://www.linkedin.com/posts/kwkramer_nvidia-just-released-a-new-open-source-transcription-activity-7414368349905821696-ufuy/
1 like โ€ข 4d
Yeah, I agree that we don't actually need to get all the way down to 500ms for actual real-world voice AI use cases. It's cool to be able to, though!
GeminiLive S2S + pipecat-flows Integration Issue
Hey everyone! I'm trying to integrate GeminiLive S2S (speech-to-speech) with pipecat-flows for a healthcare booking agent. The Problem: When pipecat-flows transitions between nodes, it sends LLMSetToolsFrame to update available tools. GeminiLive requires WebSocket reconnection when tools change (API limitation). After reconnection, the conversation state breaks and Gemini doesn't follow the new node's task messages to call functions. What works: - OpenAI LLM + Azure STT + ElevenLabs TTS with pipecat-flows โœ… - Tool updates happen seamlessly, no reconnection needed What doesn't work: - GeminiLive S2S + pipecat-flows โŒ - Every node transition โ†’ reconnection โ†’ broken flow Current workaround attempts: - Monkey-patched process_frame to handle LLMSetToolsFrame - Wait for session ready after reconnection - Trigger inference with new context messages - Still inconsistent behavior Questions: 1. Has anyone successfully used GeminiLive with pipecat-flows? 2. Is there a recommended pattern for handling tool updates without reconnection? 3. Should I create a custom adapter that pre-registers all tools at connection time? Any guidance appreciated! ๐Ÿ™
1 like โ€ข 4d
The speech-to-speech models are not currently supported by Pipecat Flows, because their APIs do not fully support the context engineering that Flows does. The gpt-realtime API has added capabilities that should be enough to support Flows, but nobody has done the engineering work for that, yet. The Gemini Live APIs are still not mature enough to support complex workflows at all, whether built on Flows or not. For something like a booking agent, I still recommend pretty strongly using the STT->LLM->TTS approach. You will have better reliability, better observability, and the latency is the same. Gemini Live voice-to-voice time is about 2.5s these days. A three-model pipeline should be <1.5s.
Musings about Vibe Coding, Pipecat, LiveKit and more
So, over the past few weeks - I've been neck deep into working with PIpecat, LiveKit and Vibe Coding. Mainly, I wanted to see what kind of milage I can get from Vibe Coding tools, and in order to test it - what's a better way than build a Pipecat/LiveKit implementation? So, I decided to examine 3 primary tools: - Claude Code - Using Sonnet 3.5 (using CLI) - OpenCode - Grok Code Fast 1 - Google Antigravity - Using Gemini 2.5 Below are my conclusions, split into several categories. ๐Ÿ’ต Financials: Most expensive to use - Claude Code Least expensive to use - OpenCode ๐Ÿ˜ก Developer Experience: Best experience - Google Antigravity Worst experience - Claude Code ๐Ÿ’ช Reliability: Most reliable - Claude Code Least reliable - OpenCode ๐Ÿš… Performance: Fastest planning and building - Google Antigravity Slowest planning and building - OpenCode So, overall - there is no "one tool to rule them all" here - and what I found out that each tool is really good at performing specific tasks. Here is what I've learned about how to "leverage" these tools in order to build something successful: - Planning can be performed with either OpenCode of Google antigravity. Google provides free developer credits for Antigravity, and their deep-thinking and reasoning engine, when applied to software architecture and design works very well. - Backend development with either ClaudeCode or Google Antigravity. When coupled with proper topic sub-agents, these are really powerful tools. For some odd reason, Claude Code is far more capable at handling complex architectures, while Google Antigravity leans towards the "hacker style" coding. - UI/UIX development - without any question, OpenCode did a better job. It was far more capable in spitting out hundreds of lines of working UI/UX code - even faster that Claude. However, if at some point it gets stuck on a specific UI component package, it may require Claude to show it the light - so pay attention to what it's doing. - Code Review, Security and Privacy - without any question, Claude is the winner here - with potentially the most extensive availability of sub-agent topic experts.
2 likes โ€ข 13d
I try to use as many AI coding environments as I can, but Claude Code is the one I keep coming back to. The integration of the model and the harness is just so good, and I personally like working in the terminal. One really encouraging thing is that all of these environments have gotten much better at writing correct Pipecat code. Six months ago, none of them could handle a library as big as Pipecat very well. We did a lot of work on the docs and core repo organization, which has definitely helped. But I think the biggest thing is just that the models and harnesses keep getting better. One thing I do, though, when writing complex Pipecat code, is clone the github repo locally, check out the release I'm using (`git checkout v0.0.98`), and then sometimes tell Claude Code: look in the pipecat/ core implementation code to see how the Foo class works, including which frames are handled and how code in pipecat/examples/foundational uses the class.
Game Changer - New Potential Client - Need Assistance!!!
I have a meeting today with a potential client. He's the Director, PMO for a private detention and correctional conglomerate. They have educational re-entry programs, transportation operational division, real estate, etc. I want to implement Voice AI tools to their operations. I just want to start out doing a small project to collaborate with him to prove what I can do. What would be a good introduction statement? What kind of demo can I do? (Examples) Ultimately, what price do I charge? Your thoughts are much appreciated.
3 likes โ€ข 28d
You can configure a demo voice agent using the Pipecat CLI in ~5 minutes, test it locally, and then deploy it for a demo: https://docs.pipecat.ai/cli/overview The CLI and everything else about Pipecat is completely open source.
Experts Advice Needed on my Pipecat Architecture
๐—›๐—ฒ๐—ฎ๐—น๐˜๐—ต๐—ฐ๐—ฎ๐—ฟ๐—ฒ ๐—ฉ๐—ผ๐—ถ๐—ฐ๐—ฒ ๐—”๐—ด๐—ฒ๐—ป๐˜ ๐—”๐—ฟ๐—ฐ๐—ต๐—ถ๐˜๐—ฒ๐—ฐ๐˜๐˜‚๐—ฟ๐—ฒ ๐—ฅ๐—ฒ๐˜ƒ๐—ถ๐—ฒ๐˜„ Hi everyone, Running a production voice agent (~500-600 calls/day) with ๐—ฝ๐—ถ๐—ฝ๐—ฒ๐—ฐ๐—ฎ๐˜-๐—ณ๐—น๐—ผ๐˜„๐˜€. Would appreciate feedback on my architecture. ๐—ช๐—ต๐˜† ๐—ฆ๐—ฒ๐—น๐—ณ-๐—›๐—ผ๐˜€๐˜๐—ฒ๐—ฑ: Tried Pipecat Cloud but Talkdesk is not supported. WebSocket is mandatory - cannot use WebRTC. ๐—”๐—ฟ๐—ฐ๐—ต๐—ถ๐˜๐—ฒ๐—ฐ๐˜๐˜‚๐—ฟ๐—ฒ: Talkdesk โ”€โ”€WSโ”€โ”€โ–บ Bridge Server (Azure App Service) โ”€โ”€WSโ”€โ”€โ–บ Pipecat Agent (Azure VM + Docker) โ€ข Bridge converts ฮผ-law 8kHz โ†” PCM 16kHz (resampling on every chunk) โ€ข 3 Docker containers behind Nginx load balancer โ€ข Each handles ~15 concurrent calls โ”€โ”€โ–บ Each container: 3GB RAM, 0.75 CPU limit โ€ข CI/CD: GitHub Actions โ†’ Docker Hub โ†’ Azure VM pull ๐—”๐—œ ๐—ฆ๐˜๐—ฎ๐—ฐ๐—ธ: โ€ข STT: Azure Speech (Italian) โ€ข LLM: OpenAI GPT-4.1 โ€ข TTS: ElevenLabs (eleven_multilingual_v2) โ€ข VAD: Silero ๐— ๐˜‚๐—น๐˜๐—ถ-๐—”๐—ด๐—ฒ๐—ป๐˜ ๐—ฆ๐—ฒ๐˜๐˜‚๐—ฝ (pipecat-flows): Router Node โ†’ detects intent โ†’ routes to: โ€ข Booking Agent (20+ step flow) โ€ข Info Agent (RAG/knowledge base) โ€ข [Future] Person specify the doctors name e.g "I want to book appointment with Dr. Jhon for heart checkup." Doctor Booking Agent Agents can transfer between each other during conversation. ๐— ๐˜† ๐—ค๐˜‚๐—ฒ๐˜€๐˜๐—ถ๐—ผ๐—ป๐˜€: ๐Ÿญ. ๐—Ÿ๐—ฎ๐˜๐—ฒ๐—ป๐—ฐ๐˜† feels high. Is the two-hop WebSocket architecture (Talkdesk โ†’ Bridge โ†’ Pipecat) causing this? Should I merge the bridge into the Pipecat container? ๐Ÿฎ. Is having a ๐˜€๐—ฒ๐—ฝ๐—ฎ๐—ฟ๐—ฎ๐˜๐—ฒ ๐—ฏ๐—ฟ๐—ถ๐—ฑ๐—ด๐—ฒ for audio conversion a common pattern, or is there a better approach? ๐Ÿฏ. ๐—ฅ๐—ผ๐˜‚๐˜๐—ถ๐—ป๐—ด ๐—ฝ๐—ฎ๐˜๐˜๐—ฒ๐—ฟ๐—ป ๐—พ๐˜‚๐—ฒ๐˜€๐˜๐—ถ๐—ผ๐—ป: I use a Router node to detect intent and route to agents. But I'm concerned this approach is too rigid. Example: Currently I route to "Booking Agent" when user says "book X-ray". But what if user says "book with Dr. Jhon" or "book with Dr. Jhon at 3pm tomorrow"? Should I create separate agents for each variation? That feels wrong - they're all booking, just with different pre-filled data. Or should the Router extract entities (doctor name, time, service) and pass them as parameters to a single flexible agent that skips steps dynamically? What's the best pattern in pipecat-flows for handling these variations without creating rigid, bounded flows for each request type?
2 likes โ€ข 28d
Unless I'm misunderstanding, I don't think you need a bridge server. You can do the u-law conversion in the WebSocket transport serializer. You can specify a serializer when you create the transport. If there's not already one for TalkDesk, it should be easy to create one based on an existing serializer. Here's the Telnyx serializer, for example. https://docs.pipecat.ai/server/services/serializers/telnyx https://github.com/pipecat-ai/pipecat/blob/08a9b434c1eafabd5416a8ec5861a8563cd9c709/src/pipecat/serializers/telnyx.py#L38 Custom serializer docs: https://docs.pipecat.ai/server/services/serializers/introduction Does that make sense? Have you tried posting about TalkDesk in the Pipecat Discord? There may be somebody there who has already implemented a TalkDesk serializer.
1-10 of 11
Kwindla Kramer
3
36points to level up
@kwindla-kramer-2446
I work on Pipecat and Daily infrastructure

Active 4d ago
Joined Nov 7, 2025