I’m building live transcription for a mobile AI coach and aiming for sub-200 ms latency. If you’ve shipped this, what worked best for you?
- Capture → Stream: AVAudioEngine (iOS) / Oboe (Android) + VAD?
- Transport: WebRTC vs WebSocket; Opus vs raw PCM; ideal chunk size for partials?
- Latency control: jitter buffers, endpointing, punctuation without lag.
- Accuracy extras: word-timestamps, diarization, noise suppression/AGC/AEC.
- Resilience: packet loss, reconnect, FEC; buffering on shaky networks.
- Privacy & cost: on-device vs cloud redaction; pricing gotchas.
Short code snippets, architectures, or repo links would be amazing—thanks!