Skip to content

Vobiz AI Voice Agent via WebSockets

This integration guide is designed for developers and engineers who wish to build custom AI voice agents directly with Vobiz. It covers everything from high-level architecture to bit-level audio transformations using WebSockets and the Python reference implementation.

Resources:

Architecture & Connectivity

The Split-Server Model

The reference implementation runs two servers concurrently:

  • FastAPI Server (Port 5000): Handles HTTP Webhooks (`/answer`, `/hangup`) and acts as a WebSocket proxy/gateway.
  • Agent Server (Port 5001): A dedicated websockets server that maintains the call state and audio stream processing.

Flow Overview

  1. The Vobiz Cloud receives a call to your Phone Number.
  2. An HTTP POST webhook is fired to your /answer route (eg., via Ngrok tunnel).
  3. Your server returns an XML <Stream> response back to Vobiz.
  4. Vobiz initiates a WSS Upgrade Request to your specified WebSocket URL.
  5. Your WSS proxies the connection to your Agent Server locally to establish the Session.
  6. A bidirectional audio stream begins directly between Vobiz and your Agent Server.

Vobiz XML Protocol

Vobiz uses a specialized XML structure to orchestrate calls. These responses must be returned by your /answer endpoint.

The Binary Audio Stream (Primary)

To initiate a bidirectional voice session, return the <Stream> tag.

XML
<?xml version="1.0" encoding="UTF-8"?>
<Response>
    <!-- 
      - bidirectional="true": Enables sound to flow both ways.
      - keepCallAlive="true": Prevents hangup when stream starts.
      - contentType: Defines audio format (G.711 mu-law is standard).
    -->
    <Stream 
        bidirectional="true" 
        keepCallAlive="true" 
        contentType="audio/x-mulaw;rate=8000" 
        statusCallbackUrl="https://your-domain.com/stream-status" 
        statusCallbackMethod="POST">
        wss://your-domain.com/ws
    </Stream>
</Response>

Handling Hangups

You can configure a global Hangup URL in your Vobiz Application settings or specify it per-call in the REST API.

  1. Log into the Vobiz Console.
  2. Navigate to Applications.
  3. Edit your voice application.
  4. Set the Hangup URL to https://<your-ngrok-url>/hangup.
  5. Set Hangup Method to POST.

WebSocket Event Documentation

Once the WebSocket handshake is successful, Vobiz and the Agent exchange JSON frames.

Inbound Events (From Vobiz to Agent)

start

Sent once at the beginning of the stream.

JSON
{
  "event": "start",
  "streamId": "s-uuid-123",
  "callId": "c-uuid-456",
  "mediaServer": "vobiz-node-05",
  "metadata": {
    "from": "+19295551212",
    "to": "+18005550199"
  }
}

media

Sent every 20ms while the caller is speaking.

JSON
{
  "event": "media",
  "media": {
    "payload": "base64_encoded_8_bit_mulaw_samples",
    "track": "inbound",
    "chunkId": "105"
  }
}

playedStream

Sent after the agent's audio reaches a checkpoint.

JSON
{
  "event": "playedStream",
  "streamId": "s-uuid-123",
  "name": "greeting_complete"
}

stop

Sent when the call ends or the stream is closed by Vobiz.

JSON
{
  "event": "stop",
  "streamId": "s-uuid-123"
}

Outbound Events (From Agent to Vobiz)

playAudio

Commands Vobiz to play sound to the caller.

JSON
{
  "event": "playAudio",
  "media": {
    "contentType": "audio/x-mulaw",
    "sampleRate": 8000,
    "payload": "base64_mulaw_bytes"
  }
}

clearAudio

Immediately stops all pending audio in the Vobiz buffer. Crucial for Barge-in (interruption).

JSON
{
  "event": "clearAudio",
  "streamId": "s-uuid-123"
}

checkpoint

Inserts a marker in the stream. Can be used to track progress of TTS delivery.

JSON
{
  "event": "checkpoint",
  "streamId": "s-uuid-123",
  "name": "step_2_instructions"
}

Audio Engineering Deep-Dive

Telephony uses the G.711 standard. Modern AI outputs high-fidelity PCM, which must be downgraded for the phone network.

G.711 Mu-law (PCMU)

  • Sample Rate: 8000 Hz (8kHz)
  • Bit Depth: 8-bit
  • Compression: Logarithmic (companding)

The Conversion Pipeline in agent.py

  1. Synthesis: AI TTS tools often return 16-bit PCM at 24,000 Hz or similar.
  2. Downsampling: We use linear interpolation to drop from 24kHz to 8kHz.
  3. Companding: Each 16-bit linear sample is converted to an 8-bit mu-law byte using bitmasks and shift operations to prioritize audible frequencies.
  4. Packetization: Audio is sent in 160-byte chunks, representing exactly 20 milliseconds of speech.

Advanced Usage & Outbound Calls

Using make_call.py

The outbound script automates the Vobiz Account/{id}/Call/ endpoint payload logic for making outbound calls.

  1. Run python server.py in one terminal window.
  2. Open a new terminal.
  3. Run python make_call.py --to +DestNumber.

REST API Payload Example

When make_call.py runs, it sends the following JSON payload to Vobiz:

JSON
{
  "from": "+YourVobizNumber",
  "to": "+CustomerNumber",
  "answer_url": "https://ngrok.url/answer",
  "answer_method": "POST",
  "hangup_url": "https://ngrok.url/hangup",
  "hangup_method": "POST"
}

Number Configuration

Ensure your Vobiz number is associated with an Application in the portal that points to your public URLs. If using make_call.py, the answer_url provided in the request will override the portal defaults for that specific call.

Troubleshooting & FAQ

The AI is talking over me / not stopping?

Check the utterance_end_ms value in the Deepgram configuration in agent.py. If it's too high, it takes longer to detect that you've started speaking. Also, ensure clearAudio is being sent instantly upon detecting user intent.

I see "401 Unauthorized" in the logs?

Ensure your NGROK_AUTH_TOKEN is set in your environment. pyngrok requires authentication for persistent tunnels and certain advanced features.

Why 20ms chunks?

The global telephony standard relies on 20ms framing. Sending larger chunks can cause "jitter" or robotic-sounding audio, while smaller chunks create excessive network overhead for the Vobiz ingress nodes.