Skip to content
Raw Data Pipeline

Voice XML Streaming WebSockets

WebSocket audio streaming is the second major architecture for connecting phone calls to AI voice agents — and it is fundamentally different from SIP.

Instead of routing calls through a SIP signaling layer to a platform like LiveKit, you instruct Vobiz to answer the call directly and stream the raw audio over a standard WebSocket connection straight to your server. Your server is the AI pipeline.

This is the lowest-level, most direct integration path available. No third-party platform sits between Vobiz and your code. Every byte of audio is yours to process however you choose — which means maximum control, minimum cost, and full responsibility for everything that connects the pieces.

Key Insight

Voice XML streaming is not a replacement for SIP — it is a different integration layer entirely. SIP handles call routing and control at the telephony layer. WebSocket streaming handles audio delivery at the application layer. You can use both in the same architecture (e.g., SIP to route the call to Vobiz, then VoiceXML streaming to pipe audio to your Python server).

How WebSocket Streaming Works

When an inbound call arrives at Vobiz, the platform needs to know what to do with it. With Voice XML streaming, you configure a webhook URL that Vobiz fetches, which returns a VoiceXML document containing a <Connect><Stream> directive. This tells Vobiz: "Connect this call to my WebSocket server and start streaming audio."

The VoiceXML directive that starts WebSocket streaming
<?xml version="1.0" encoding="UTF-8"?>
<Response>
<Connect>
<Stream url="wss://your-server.com/ws" />
</Connect>
</Response>

Once Vobiz receives this directive, it establishes a WebSocket connection to your server and begins forwarding the caller's audio in real time. Simultaneously, any audio your server sends back over the same WebSocket is played to the caller. The connection is bidirectional and persistent for the entire duration of the call.

WebSocket Audio Pipeline Flow
PSTNCaller dials your Vobiz DID number
Vobiz fetches your webhook URL → receives VoiceXML <Stream> directive
V
VobizOpens WebSocket to wss://your-server.com/ws
Sends "connected" + "start" JSON messages with call metadata
Streams raw audio as "media" messages (base64 µ-law, 20ms chunks)
Your Server (AI Pipeline)Receives audio STT LLM TTS sends audio back via WebSocket
Vobiz plays returned audio to caller in real time
PSTNCaller hears your AI agent's voice response

WebSocket Message Types

All messages over the WebSocket connection are JSON. There are five event types your server will receive, and one type you will send back:

connectedReceive

Sent immediately when the WebSocket connection is established. Contains the WebSocket protocol version. No call-specific data yet.

startReceive

Sent once when the audio stream begins. Contains the StreamSid (unique stream identifier), CallSid, and any custom parameters you configured on the Stream directive. This is where you initialize your per-call state.

mediaReceive + Send

The audio data itself. Received continuously while the caller is speaking. Contains a base64-encoded payload of G.711 µ-law audio at 8kHz. To send audio to the caller, you send the same format back with the StreamSid.

dtmfReceive

Sent when the caller presses a phone key (0–9, *, #). Contains the digit pressed. Important: DTMF tones are detected by Vobiz and delivered as discrete events — they do NOT appear in the audio stream. You must handle this separately.

stopReceive

Sent when the call ends (caller hangs up, or the call is terminated programmatically). After receiving this, the WebSocket connection will close. Clean up all per-call resources here.

G.711 µ-Law: The Audio Encoding

Every audio payload arriving from Vobiz (and that you must send back) is encoded as G.711 µ-law (PCMU). This is not an arbitrary choice — it is the standard audio encoding of the global telephone network, and has been since the 1960s. Understanding what it is, and why your AI models cannot use it directly, is essential.

Sample Rate
8,000 Hz

Telephone-quality (narrow band). Human voice is 300–3400 Hz. Sufficient for intelligibility but poor for high-fidelity synthesis.

Bit Depth
8 bits / sample

After companding (non-linear compression). Equivalent to ~12 bits of linear PCM in perceived dynamic range.

Bitrate
64 kbps

8,000 samples/sec × 8 bits = 64,000 bits/sec. Delivered in precise 20ms frames = 160 bytes per packet.

The µ (mu) in µ-law refers to a logarithmic companding function that applies non-linear compression to the audio signal before encoding. This gives more dynamic range resolution to quiet sounds and compresses loud sounds. It is not standard PCM — you cannot feed µ-law bytes directly to a speech recognition model.

Why this matters: When Vobiz sends you audio, it arrives base64-encoded in JSON. Before your STT model can process it, you must: (1) base64-decode it, (2) decode µ-law to linear 16-bit PCM, (3) upsample from 8kHz to 16kHz (or whatever rate your STT model expects). Before sending audio back, you must reverse the chain. Frameworks like Pipecat handle this automatically via serializers.

The Audio Conversion Pipeline

Every WebSocket voice AI implementation involves this conversion chain on both the inbound and outbound paths. Understanding each step prevents the most common bugs: distorted audio, incorrect volume levels, and desynchronized playback.

Inbound

Caller → JSON → AI Pipeline

1
Receive JSON
"media" event with base64 payload
2
Base64 Decode
String → bytes (µ-law)
3
Decode µ-law
8-bit µ-law → 16-bit PCM (8kHz)
4
Resample
8kHz → 16kHz/24kHz PCM
5
Feed STT
Stream to STT engine explicitly
Outbound

AI Response → JSON → Caller

1
Generate Audio
TTS engine returns 16kHz/24kHz PCM
2
Resample
16kHz/24kHz → 8kHz PCM
3
Encode µ-law
16-bit 8kHz PCM → 8-bit µ-law
4
Base64 Encode
µ-law bytes → base64 string
5
Send JSON
Transmit {"event":"media"} over wss

Python's built-in audioop module handles µ-law conversion and resampling natively. Functions like audioop.ulaw2lin() and audioop.ratecv() perform companding and sample rate conversion. Pipecat's serializers execute all of this for you flawlessly behind the scenes.

Advantages

No Third-Party Platform Cost

There is no LiveKit, VAPI, or Retell layer. You pay Vobiz for the channel and the raw AI API costs (STT, LLM, TTS). At scale, this is a highly significant cost difference.

Direct AI Model Integration

Deepgram, OpenAI, Anthropic, and Cartesia all natively consume/produce WebSocket audio streams. Your Vobiz stream connects almost directly without massive intervening SDK abstractions.

Full Pipeline Control

Every byte of audio flows right through your code. You can implement highly tuned custom VAD logic, custom barge-in states, and bespoke routing trees directly.

Easiest to Debug Locally

A WebSocket server is just a web server. Test locally with ngrok and inspect messages in browser DevTools. SIP debugging requires specialized tools like Wireshark.

Familiar Technology Stack

WebSockets are standard web technologies. Any Python, Node.js, or Go developer intuitively handles them without needing to master legacy SIP headers and RTP setups.

No IP Allowlisting Required

Authentication happens cleanly at the connection level via URL parameters or headers, drastically simpler than IP-based ACLs that break instantly upon provider network changes.

Disadvantages

One Stateful Connection Per Call

You manually shoulder state. 100 concurrent calls = 100 simultaneous open sockets maintaining context buffers tightly. Crashing mid-call instantly destroys that specific call.

Barge-In Requires Implementation

Getting AI interruptions right requires orchestrating VAD thresholds, aborting in-flight TTS playback, flushing buffers, and resetting logic, which is mathematically intricate.

Turn Detection is Extremely Hard

Separating sentence-pauses from true turn-handovers demands machine learning models running locally (Silero VAD) to eliminate jarring false-positive interruptions on background noises.

Conversion Chain Complexity

The µ-law → PCM → AI return cycle is ruthless regarding byte matching. Wrong endianness or ratio mismatches generate severely distorted static that lacks any clear error logs.

TCP vs. UDP Protocol Conflicts

WebSockets mandate reliable TCP handshakes which guarantee package delivery. If an audio packet stalls, TCP retransmission holds up all subsequent arrays, injecting brutal jitter where UDP simply skips it seamlessly.

No Enterprise PBX Out-of-the-Box

Massive internal corporate infrastructures (Avaya, Teams) do not recognize WebSocket streams exclusively. If they are an absolute integration requirement, you definitively need a SIP trunk architecture.

Pipecat Integration

Pipecat (open-source, from Daily.co) is the leading Python framework for building WebSocket-based voice pipelines. It provides abstractions that make streaming architectures production-viable effortlessly — shielding you from audio conversion, pipeline orchestration, VAD, and barge-in logistics automatically.

Pipeline Architecture

Pipecat models a voice call as a linear chain of processors. Each processor receives frames (audio, text, control markers) and hands outputs downstream. A flawless mapping of conceptual voice functionality.

Conceptual Pipecat pipeline
transport.input()
↓ AudioRawFrame (µ-law from Vobiz)
stt # e.g. DeepgramSTTService
↓ TranscriptionFrame (text)
llm # e.g. OpenAILLMService
↓ TextFrame (response tokens)
tts # e.g. ElevenLabsTTSService
↓ AudioRawFrame (PCM)
transport.output()

Vobiz Serializer

The VobizFrameSerializer handles all base64 decoding, µ-law ↔ PCM conversion, and Vobiz-specific JSON message framing automatically — so your pipeline receives clean audio frames with no manual conversion code.

VobizFrameSerializer

Built-In VAD

Integrates highly-calibrated Silero VAD locally out of the box. Intercepts false positives cleanly and initiates intricate TTS-cancellations immediately yielding seamless conversational cadences automatically.

Important Constraints

Pipecat's WebSocketServerTransport binds strictly to one active connection per singular process. Parallel caller influx mandates distinct Pipecat workers running concurrently deployed utilizing independent pools or Docker pods. Single servers executing dual WSS frames identically inherently fail.

Direct Python + Vobiz

The alternative involves orchestrating raw bare-metal ASGI/WSGI servers actively. Taking direct absolute ownership over all encoding layers yielding optimal customizability but vastly heightened difficulty trajectories.

Framework Heavy Lifting

  • Automatic µ-law bindings
  • Base64 decode array loops
  • STT module synchronicity
  • Silero threshold tweaking
  • Turn-detection history states

Bare-Metal Responsibilities

  • Asyncio multiplexer state trees
  • Buffer bit-math equations
  • Latency compensation code
  • In-flight manual TTS pausing
  • DTMF intercept flags

When Extreme Control Matters

  • 1

    Hyper-Bespoke Pipelines

    Building experimental multimodal arrays unsupported by standardized open-source serialization components entirely.

  • 2

    Legacy FastAPI Cohesion

    Merging native WSS endpoints directly into vast existing Python monolithic APIs devoid of introducing parallel orchestration tools.

  • 3

    Milisecond Audio Manipulations

    Surgical manual injections of specific IVR branching prompts lacking pipeline abstraction layering delays exclusively.

Key Implementation Factors

Development Complexity

Medium (Pipecat) / High (Bare/Metal)

With Pipecat, a working voice agent completes in hours. Bare-metal Python mandates implementing manual byte manipulation equations directly.

Capital Expenditures Strategy

Absolute Minimum Thresholds

You uniquely isolate operational expenditures specifically isolating AI LLM calls alongside pristine raw Vobiz channel minute metrics only. Highly desirable for scaling.

Baseline Deployment Horizons

2–4 hours (Pipecat)

Frameworks allow immediate connectivity logic validation ensuring live microphone inputs transcribe effectively inside one working afternoon smoothly.

Algorithmic Speed Mapping

~Instant Link + 20ms Audio Vectors

Lack of initial deep network SIP handshakes permits WSS calls snapping active phenomenally faster. Total speed exclusively depends completely on API token deliveries.

Critical Process Scaling

1 Concurrent Frame / 1 Open Port Server

Every persistent individual call physically holds unique server instances dynamically. High load scenarios intensely tax scaling horizontal architectures exponentially contrasted against typical SIP.

Coding Danger Zones

01

Oversending Synchronicity Arrays

Delivering unified huge processed TTS streams aggressively into 20ms-designed sockets immediately swamps platform gates leading to vicious package rejections.

02

Little-Endian Byte Flaws

Python naturally extracts little-endian data structures. If specific LLM endpoints uniquely require rigid big-endian outputs, raw static noise generates evenly confusing debugging.

03

Persistent OOM Leaks

Failing to rigorously destroy asynchronous STT generators manually upon receipt of absolute WSS physical closure pings yields eventual massive operational memory collapses globally.

04

Symmetrical Async Locks

Invoking separate isolated parallel routines simultaneously firing towards singleton WSS outputs inevitably trips brutal write collision overlapping fatal socket errors easily.

05

VAD Metric Interference

Inherent base loudness-checking architectures universally fail against television sets or ambient sirens continuously auto-cancelling critical outgoing response streams entirely.

06

DTMF Audio Corruption

Injecting raw button frequencies natively directly within the stream violently shatters voice transcription comprehension yielding completely silent outputs. Use discrete outbound events exclusively.

Final Verdict Strategy

Strict Cost Objectives

Eradicates secondary abstraction usage premiums maximizing financial margin scaling securely alone.

Bespoke AI Architecture

Ensures absolute rigid isolation ownership dictating precise stateful graph flow designs manually independently.

Pipecat Centric Builds

Perfect synergy. Framework functions uniquely target native WSS transport boundaries explicitly identically inherently.

Vanguard Prototyping

Pioneers rapid, low-friction Python ngrok execution paths ensuring fully functional AI demonstrations manifest overnight swiftly.

Absence of Referral Scenarios

Permits foregoing SIP's elite built-in live transfer arrays assuming pure singular interaction scopes explicitly successfully.

High-Level Stack Orientation

Empowers standard web-centric Python/JS teams navigating comfortably strictly devoid from mastering obscure Telecom complexities vastly.

Need seamless PBX interaction?

If highly enterprise-compliant transferring interfaces, deep LiveKit routing, or VAPI intermissions form absolute mandatory requirements dynamically — investigate overarching SIP options intently.

Compare SIP Matrix