Vobiz AI Voice Agent via WebSockets

This integration guide is designed for developers and engineers who wish to build custom AI voice agents directly with Vobiz. It covers everything from high-level architecture to bit-level audio transformations using WebSockets and the Python reference implementation.

Resources:

GitHub Repository: Vobiz-Python-Voice-API-Example Reference Implementation
Primary Language: Python 3.11+
Core Frameworks: FastAPI (HTTP), websockets (WSS), pyngrok (Tunneling)

Architecture & Connectivity

The Split-Server Model

The reference implementation runs two servers concurrently:

FastAPI Server (Port 5000): Handles HTTP Webhooks (`/answer`, `/hangup`) and acts as a WebSocket proxy/gateway.
Agent Server (Port 5001): A dedicated websockets server that maintains the call state and audio stream processing.

Flow Overview

The Vobiz Cloud receives a call to your Phone Number.
An HTTP POST webhook is fired to your /answer route (eg., via Ngrok tunnel).
Your server returns an XML <Stream> response back to Vobiz.
Vobiz initiates a WSS Upgrade Request to your specified WebSocket URL.
Your WSS proxies the connection to your Agent Server locally to establish the Session.
A bidirectional audio stream begins directly between Vobiz and your Agent Server.

Vobiz XML Protocol

Vobiz uses a specialized XML structure to orchestrate calls. These responses must be returned by your /answer endpoint.

The Binary Audio Stream (Primary)

To initiate a bidirectional voice session, return the <Stream> tag.

XML

<?xml version="1.0" encoding="UTF-8"?>
<Response>
    <!-- 
      - bidirectional="true": Enables sound to flow both ways.
      - keepCallAlive="true": Prevents hangup when stream starts.
      - contentType: Defines audio format (G.711 mu-law is standard).
    -->
    <Stream 
        bidirectional="true" 
        keepCallAlive="true" 
        contentType="audio/x-mulaw;rate=8000" 
        statusCallbackUrl="https://your-domain.com/stream-status" 
        statusCallbackMethod="POST">
        wss://your-domain.com/ws
    </Stream>
</Response>

Handling Hangups

You can configure a global Hangup URL in your Vobiz Application settings or specify it per-call in the REST API.

Log into the Vobiz Console.
Navigate to Applications.
Edit your voice application.
Set the Hangup URL to https://<your-ngrok-url>/hangup.
Set Hangup Method to POST.

WebSocket Event Documentation

Once the WebSocket handshake is successful, Vobiz and the Agent exchange JSON frames.

Inbound Events (From Vobiz to Agent)

`start`

Sent once at the beginning of the stream.

JSON

{
  "event": "start",
  "streamId": "s-uuid-123",
  "callId": "c-uuid-456",
  "mediaServer": "vobiz-node-05",
  "metadata": {
    "from": "+19295551212",
    "to": "+18005550199"
  }
}

`media`

Sent every 20ms while the caller is speaking.

JSON

{
  "event": "media",
  "media": {
    "payload": "base64_encoded_8_bit_mulaw_samples",
    "track": "inbound",
    "chunkId": "105"
  }
}

`playedStream`

Sent after the agent's audio reaches a checkpoint.

JSON

{
  "event": "playedStream",
  "streamId": "s-uuid-123",
  "name": "greeting_complete"
}

`stop`

Sent when the call ends or the stream is closed by Vobiz.

JSON

{
  "event": "stop",
  "streamId": "s-uuid-123"
}

Outbound Events (From Agent to Vobiz)

`playAudio`

Commands Vobiz to play sound to the caller.

JSON

{
  "event": "playAudio",
  "media": {
    "contentType": "audio/x-mulaw",
    "sampleRate": 8000,
    "payload": "base64_mulaw_bytes"
  }
}

`clearAudio`

Immediately stops all pending audio in the Vobiz buffer. Crucial for Barge-in (interruption).

JSON

{
  "event": "clearAudio",
  "streamId": "s-uuid-123"
}

`checkpoint`

Inserts a marker in the stream. Can be used to track progress of TTS delivery.

JSON

{
  "event": "checkpoint",
  "streamId": "s-uuid-123",
  "name": "step_2_instructions"
}

Audio Engineering Deep-Dive

Telephony uses the G.711 standard. Modern AI outputs high-fidelity PCM, which must be downgraded for the phone network.

G.711 Mu-law (PCMU)

Sample Rate: 8000 Hz (8kHz)
Bit Depth: 8-bit
Compression: Logarithmic (companding)

The Conversion Pipeline in `agent.py`

Synthesis: AI TTS tools often return 16-bit PCM at 24,000 Hz or similar.
Downsampling: We use linear interpolation to drop from 24kHz to 8kHz.
Companding: Each 16-bit linear sample is converted to an 8-bit mu-law byte using bitmasks and shift operations to prioritize audible frequencies.
Packetization: Audio is sent in 160-byte chunks, representing exactly 20 milliseconds of speech.

Advanced Usage & Outbound Calls

Using `make_call.py`

The outbound script automates the Vobiz Account/{id}/Call/ endpoint payload logic for making outbound calls.

Run python server.py in one terminal window.
Open a new terminal.
Run python make_call.py --to +DestNumber.

REST API Payload Example

When make_call.py runs, it sends the following JSON payload to Vobiz:

JSON

{
  "from": "+YourVobizNumber",
  "to": "+CustomerNumber",
  "answer_url": "https://ngrok.url/answer",
  "answer_method": "POST",
  "hangup_url": "https://ngrok.url/hangup",
  "hangup_method": "POST"
}

Number Configuration

Ensure your Vobiz number is associated with an Application in the portal that points to your public URLs. If using make_call.py, the answer_url provided in the request will override the portal defaults for that specific call.

Troubleshooting & FAQ

The AI is talking over me / not stopping?

Check the utterance_end_ms value in the Deepgram configuration in agent.py. If it's too high, it takes longer to detect that you've started speaking. Also, ensure clearAudio is being sent instantly upon detecting user intent.

I see "401 Unauthorized" in the logs?

Ensure your NGROK_AUTH_TOKEN is set in your environment. pyngrok requires authentication for persistent tunnels and certain advanced features.

Why 20ms chunks?

The global telephony standard relies on 20ms framing. Sending larger chunks can cause "jitter" or robotic-sounding audio, while smaller chunks create excessive network overhead for the Vobiz ingress nodes.

Bolna.ai

OpenAI Realtime

Vobiz AI Voice Agent via WebSockets

Architecture & Connectivity

The Split-Server Model

Flow Overview

Vobiz XML Protocol

The Binary Audio Stream (Primary)

Handling Hangups

WebSocket Event Documentation

Inbound Events (From Vobiz to Agent)

start

media

playedStream

stop

Outbound Events (From Agent to Vobiz)

playAudio

clearAudio

checkpoint

Audio Engineering Deep-Dive

G.711 Mu-law (PCMU)

The Conversion Pipeline in agent.py

Advanced Usage & Outbound Calls

Using make_call.py

REST API Payload Example

Number Configuration

Troubleshooting & FAQ

The AI is talking over me / not stopping?

I see "401 Unauthorized" in the logs?

Why 20ms chunks?

`start`

`media`

`playedStream`

`stop`

`playAudio`

`clearAudio`

`checkpoint`

The Conversion Pipeline in `agent.py`

Using `make_call.py`