Hey everyone! ๐
Today I want to share something that took us months to figure out: **how to make voice AI agents that can actually make and receive real phone calls.**
Building an AI that works in a browser? Easy. Making it call phone numbers and handle incoming calls from customers? That's where things get complex.
## The Two Worlds Problem
Your voice AI lives in the modern world (WebRTC, low latency, great quality).
Your customers' phones live in the traditional world (SIP, PSTN, variable quality, higher latency).
**You need to bridge both worlds.**
## Our Architecture (Simplified)
```
Web/App Users Phone Users
(WebRTC) (SIP/PSTN)
โ โ
โ โ via Twilio
โโโโโโโโฌโโโโโโโโโโโโโโโ
โ
โโโโโโโโโผโโโโโโโโโ
โ Media Gateway โ
โ (Janus + Free โ
โ SWITCH) โ
โโโโโโโโโฌโโโโโโโโโ
โ Normalized audio
โโโโโโโโโผโโโโโโโโโ
โ Voice AI Core โ
โ (VAD/STT/LLM/ โ
โ TTS) โ
โโโโโโโโโโโโโโโโโโ
```
**The media gateway is the magic:**
- Converts WebRTC โ SIP protocols
- Transcodes audio formats (Opus โ G.711)
- Normalizes everything for AI processing
- Routes calls efficiently
## Tech Stack We Use
**For WebRTC:** Janus Gateway (lightweight, plugin-based)
**For SIP:** FreeSWITCH (handles 1000+ concurrent calls)
**SIP Provider:** Twilio (primary), Vonage (backup)
## Handling Inbound Calls (Customer โ AI)
**Step 1:** Customer calls your number
**Step 2:** Twilio sends webhook to your server
**Step 3:** You return TwiML to stream audio via WebSocket
```python
@app.route("/voice/incoming", methods=['POST'])
def handle_incoming():
response = VoiceResponse()
# Stream audio to your server
stream = Stream(url='wss://your-server.com/media-stream')
response.append(stream)
return Response(str(response), mimetype='text/xml')
```
**Step 4:** Audio streams bidirectionally (20ms chunks)
**Step 5:** Your VAD detects speech, processes, responds
## Handling Outbound Calls (AI โ Customer)
```python
from twilio.rest import Client
client = Client(ACCOUNT_SID, AUTH_TOKEN)
call = client.calls.create(
to='+1-555-987-6543',
from_='+1-555-123-4567',
)
```
When customer answers, you get a webhook and start streaming audio.
Your AI leads the conversation (greeting, questions, responses).
## VAD/EOU with Phone Audio
This is tricky. Phone audio has:
- Variable quality (G.711 compression)
- Background noise (cars, crowds)
- Echo and feedback
- Latency variations
**Our solution:** Adaptive VAD
```python
class RobustVAD:
def __init__(self, call_type):
self.model = load_silero_vad()
# Adaptive thresholds
if call_type == 'sip':
self.vad_threshold = 0.6 # More conservative
self.silence_threshold = 500 # ms
else: # webrtc
self.vad_threshold = 0.5
self.silence_threshold = 300 # ms
def process_chunk(self, audio_chunk):
speech_prob = self.model(audio_chunk, 16000)
# Adjust threshold based on audio quality
threshold = self.adapt_threshold()
if speech_prob > threshold:
return "SPEECH"
else:
return "SILENCE"
```
The VAD adapts to noise levels in real-time. Phone calls get more conservative thresholds.
## WebRTC for Web Users (The Fast Path)
For web/app users, bypass SIP entirely:
```javascript
// Client side
const stream = await navigator.mediaDevices.getUserMedia({
audio: {
echoCancellation: true,
noiseSuppression: true
}
});
const peerConnection = new RTCPeerConnection();
stream.getTracks().forEach(track => {
peerConnection.addTrack(track, stream);
});
```
**Direct connection โ 200-400ms faster than SIP**
## Real Production Numbers
**WebRTC calls (web/app):**
- Total latency: 700-1200ms
- Audio quality: Excellent
- Cost: ~$0.005/minute (compute only)
**SIP calls (phone):**
- Total latency: 950-1700ms
- Audio quality: Good (depends on carrier)
- Cost: ~$0.012/minute (Twilio + compute)
**Capacity per server:**
- 500 concurrent calls
- 2-5% CPU per call
- 50-100MB memory per call
**Reliability:**
- 94% call completion rate
- <2% audio quality issues
- 99.7% uptime
## Key Challenges We Solved
**1. Audio Format Hell**
- WebRTC uses Opus (great quality, low latency)
- Phone networks use G.711 (legacy, more latency)
- Solution: Media gateway transcodes everything to 16kHz PCM for AI
**2. Latency Variations**
- WebRTC: 20-50ms network latency
- SIP: 100-300ms network latency
- Solution: Adaptive timeouts based on call type
**3. Echo and Feedback**
- Phone calls can create echo loops
- Solution: Server-side echo cancellation + conservative VAD
**4. Load Balancing**
- Need to route calls to nearest server
- Solution: Geographic routing based on caller location
## The Benefits of This Architecture
โ
**Unified processing:** Same AI engine handles web and phone calls
โ
**Geographic routing:** Low latency by routing to nearest server
โ
**Scalability:** Add servers to handle more calls
โ
**Flexibility:** Support both inbound and outbound calls
โ
**Quality:** WebRTC users get best quality, phone users get reliable quality
## Common Mistakes to Avoid
โ **Using SIP for web users** - Unnecessary latency
โ **Not adapting VAD for call type** - Phone calls need different thresholds
โ **Ignoring audio format conversion** - Causes quality degradation
โ **Single-region deployment** - Higher latency for distant users
โ **Not monitoring call quality** - You need real-time metrics
## What's Coming Next
Next article: **Deep-dive architecture breakdown**
I'll cover:
- Horizontal scaling strategies
- Database design for call state
- Real-time monitoring setup
- Disaster recovery
- Security and compliance
- Cost optimization
**The complete blueprint for production voice AI.**
## Questions for You ๐ค
1. Are you using SIP, WebRTC, or both?
2. What SIP provider are you using? (Twilio, Vonage, Telnyx?)
3. What's your biggest challenge with phone integration?
4. What latency are you seeing for phone calls?
5. How are you handling VAD on noisy phone audio?
Drop your answers below! And if you're struggling with any of this, let me knowโhappy to help troubleshoot.
---
**Full technical deep-dive on Medium:**
Building voice AI that connects to the real world is hard, but it's solvable. This architecture handles thousands of calls daily in production.
Let's keep sharing and learning together!