NVIDIA released a new open source speech-to-text model designed from the ground up for low-latency use cases like voice agents. This is part of NVIDIA's new focus on open models, which I'm excited about. These new models in the Nemotron family include STT and TTS models, specialized models like guardrail models and LLMs. And they are completely open: open weights, training code, training data sets, and inference tooling.
This new STT model is very fast. Here's a voice agent running locally on my RTX 5090 with sub-500ms voice-to-voice inference.
Also, Twitter and LinkedIn if either of those platforms are your thing. (I post a lot about voice agents on both platforms.)