We’re building a call-style UI where:
- The user can talk freely (like a real phone call)
- Silence detection determines when a “turn” ends
- Short pauses are merged into one thought
- The AI can be interrupted if the user starts talking
- Audio playback and mic capture work reliably on iOS Safari
Right now we’re running into issues where:
- Silence detection doesn’t reliably stop listening
- Turns fire too early or too late
- Transcription sometimes fails or never triggers
- iOS Safari adds extra constraints around audio unlock and playback
If you’ve solved this (or seen a solid pattern for frontend VAD + turn management in the browser), I’d love to hear:
- What approach worked for you
- Any gotchas with MediaRecorder / Web Audio API
- Whether you moved logic frontend vs backend
Appreciate any war stories or architecture advice 🙏