Vobiz AI Voice Agent via WebSockets
This integration guide is designed for developers and engineers who wish to build custom AI voice agents directly with Vobiz. It covers everything from high-level architecture to bit-level audio transformations using WebSockets and the Python reference implementation.
Resources:
- GitHub Repository: Vobiz-Python-Voice-API-Example Reference Implementation
- Primary Language: Python 3.11+
- Core Frameworks: FastAPI (HTTP), websockets (WSS), pyngrok (Tunneling)
Architecture & Connectivity
The Split-Server Model
The reference implementation runs two servers concurrently:
- FastAPI Server (Port 5000): Handles HTTP Webhooks (`/answer`, `/hangup`) and acts as a WebSocket proxy/gateway.
- Agent Server (Port 5001): A dedicated
websocketsserver that maintains the call state and audio stream processing.
Flow Overview
- The Vobiz Cloud receives a call to your Phone Number.
- An HTTP POST webhook is fired to your
/answerroute (eg., via Ngrok tunnel). - Your server returns an XML
<Stream>response back to Vobiz. - Vobiz initiates a WSS Upgrade Request to your specified WebSocket URL.
- Your WSS proxies the connection to your Agent Server locally to establish the Session.
- A bidirectional audio stream begins directly between Vobiz and your Agent Server.
Vobiz XML Protocol
Vobiz uses a specialized XML structure to orchestrate calls. These responses must be returned by your /answer endpoint.
The Binary Audio Stream (Primary)
To initiate a bidirectional voice session, return the <Stream> tag.
<?xml version="1.0" encoding="UTF-8"?>
<Response>
<!--
- bidirectional="true": Enables sound to flow both ways.
- keepCallAlive="true": Prevents hangup when stream starts.
- contentType: Defines audio format (G.711 mu-law is standard).
-->
<Stream
bidirectional="true"
keepCallAlive="true"
contentType="audio/x-mulaw;rate=8000"
statusCallbackUrl="https://your-domain.com/stream-status"
statusCallbackMethod="POST">
wss://your-domain.com/ws
</Stream>
</Response>Handling Hangups
You can configure a global Hangup URL in your Vobiz Application settings or specify it per-call in the REST API.
- Log into the Vobiz Console.
- Navigate to Applications.
- Edit your voice application.
- Set the Hangup URL to
https://<your-ngrok-url>/hangup. - Set Hangup Method to
POST.
WebSocket Event Documentation
Once the WebSocket handshake is successful, Vobiz and the Agent exchange JSON frames.
Inbound Events (From Vobiz to Agent)
start
Sent once at the beginning of the stream.
{
"event": "start",
"streamId": "s-uuid-123",
"callId": "c-uuid-456",
"mediaServer": "vobiz-node-05",
"metadata": {
"from": "+19295551212",
"to": "+18005550199"
}
}media
Sent every 20ms while the caller is speaking.
{
"event": "media",
"media": {
"payload": "base64_encoded_8_bit_mulaw_samples",
"track": "inbound",
"chunkId": "105"
}
}playedStream
Sent after the agent's audio reaches a checkpoint.
{
"event": "playedStream",
"streamId": "s-uuid-123",
"name": "greeting_complete"
}stop
Sent when the call ends or the stream is closed by Vobiz.
{
"event": "stop",
"streamId": "s-uuid-123"
}Outbound Events (From Agent to Vobiz)
playAudio
Commands Vobiz to play sound to the caller.
{
"event": "playAudio",
"media": {
"contentType": "audio/x-mulaw",
"sampleRate": 8000,
"payload": "base64_mulaw_bytes"
}
}clearAudio
Immediately stops all pending audio in the Vobiz buffer. Crucial for Barge-in (interruption).
{
"event": "clearAudio",
"streamId": "s-uuid-123"
}checkpoint
Inserts a marker in the stream. Can be used to track progress of TTS delivery.
{
"event": "checkpoint",
"streamId": "s-uuid-123",
"name": "step_2_instructions"
}Audio Engineering Deep-Dive
Telephony uses the G.711 standard. Modern AI outputs high-fidelity PCM, which must be downgraded for the phone network.
G.711 Mu-law (PCMU)
- Sample Rate: 8000 Hz (8kHz)
- Bit Depth: 8-bit
- Compression: Logarithmic (companding)
The Conversion Pipeline in agent.py
- Synthesis: AI TTS tools often return 16-bit PCM at 24,000 Hz or similar.
- Downsampling: We use linear interpolation to drop from 24kHz to 8kHz.
- Companding: Each 16-bit linear sample is converted to an 8-bit mu-law byte using bitmasks and shift operations to prioritize audible frequencies.
- Packetization: Audio is sent in 160-byte chunks, representing exactly 20 milliseconds of speech.
Advanced Usage & Outbound Calls
Using make_call.py
The outbound script automates the Vobiz Account/{id}/Call/ endpoint payload logic for making outbound calls.
- Run
python server.pyin one terminal window. - Open a new terminal.
- Run
python make_call.py --to +DestNumber.
REST API Payload Example
When make_call.py runs, it sends the following JSON payload to Vobiz:
{
"from": "+YourVobizNumber",
"to": "+CustomerNumber",
"answer_url": "https://ngrok.url/answer",
"answer_method": "POST",
"hangup_url": "https://ngrok.url/hangup",
"hangup_method": "POST"
}Number Configuration
Ensure your Vobiz number is associated with an Application in the portal that points to your public URLs. If using make_call.py, the answer_url provided in the request will override the portal defaults for that specific call.
Troubleshooting & FAQ
The AI is talking over me / not stopping?
Check the utterance_end_ms value in the Deepgram configuration in agent.py. If it's too high, it takes longer to detect that you've started speaking. Also, ensure clearAudio is being sent instantly upon detecting user intent.
I see "401 Unauthorized" in the logs?
Ensure your NGROK_AUTH_TOKEN is set in your environment. pyngrok requires authentication for persistent tunnels and certain advanced features.
Why 20ms chunks?
The global telephony standard relies on 20ms framing. Sending larger chunks can cause "jitter" or robotic-sounding audio, while smaller chunks create excessive network overhead for the Vobiz ingress nodes.