Voice XML Streaming WebSockets
WebSocket audio streaming is the second major architecture for connecting phone calls to AI voice agents — and it is fundamentally different from SIP.
Instead of routing calls through a SIP signaling layer to a platform like LiveKit, you instruct Vobiz to answer the call directly and stream the raw audio over a standard WebSocket connection straight to your server. Your server is the AI pipeline.
This is the lowest-level, most direct integration path available. No third-party platform sits between Vobiz and your code. Every byte of audio is yours to process however you choose — which means maximum control, minimum cost, and full responsibility for everything that connects the pieces.
Key Insight
Voice XML streaming is not a replacement for SIP — it is a different integration layer entirely. SIP handles call routing and control at the telephony layer. WebSocket streaming handles audio delivery at the application layer. You can use both in the same architecture (e.g., SIP to route the call to Vobiz, then VoiceXML streaming to pipe audio to your Python server).
How WebSocket Streaming Works
When an inbound call arrives at Vobiz, the platform needs to know what to do with it. With Voice XML streaming, you configure a webhook URL that Vobiz fetches, which returns a VoiceXML document containing a <Connect><Stream> directive. This tells Vobiz: "Connect this call to my WebSocket server and start streaming audio."
Once Vobiz receives this directive, it establishes a WebSocket connection to your server and begins forwarding the caller's audio in real time. Simultaneously, any audio your server sends back over the same WebSocket is played to the caller. The connection is bidirectional and persistent for the entire duration of the call.
WebSocket Message Types
All messages over the WebSocket connection are JSON. There are five event types your server will receive, and one type you will send back:
Sent immediately when the WebSocket connection is established. Contains the WebSocket protocol version. No call-specific data yet.
Sent once when the audio stream begins. Contains the StreamSid (unique stream identifier), CallSid, and any custom parameters you configured on the Stream directive. This is where you initialize your per-call state.
The audio data itself. Received continuously while the caller is speaking. Contains a base64-encoded payload of G.711 µ-law audio at 8kHz. To send audio to the caller, you send the same format back with the StreamSid.
Sent when the caller presses a phone key (0–9, *, #). Contains the digit pressed. Important: DTMF tones are detected by Vobiz and delivered as discrete events — they do NOT appear in the audio stream. You must handle this separately.
Sent when the call ends (caller hangs up, or the call is terminated programmatically). After receiving this, the WebSocket connection will close. Clean up all per-call resources here.
G.711 µ-Law: The Audio Encoding
Every audio payload arriving from Vobiz (and that you must send back) is encoded as G.711 µ-law (PCMU). This is not an arbitrary choice — it is the standard audio encoding of the global telephone network, and has been since the 1960s. Understanding what it is, and why your AI models cannot use it directly, is essential.
Telephone-quality (narrow band). Human voice is 300–3400 Hz. Sufficient for intelligibility but poor for high-fidelity synthesis.
After companding (non-linear compression). Equivalent to ~12 bits of linear PCM in perceived dynamic range.
8,000 samples/sec × 8 bits = 64,000 bits/sec. Delivered in precise 20ms frames = 160 bytes per packet.
The µ (mu) in µ-law refers to a logarithmic companding function that applies non-linear compression to the audio signal before encoding. This gives more dynamic range resolution to quiet sounds and compresses loud sounds. It is not standard PCM — you cannot feed µ-law bytes directly to a speech recognition model.
The Audio Conversion Pipeline
Every WebSocket voice AI implementation involves this conversion chain on both the inbound and outbound paths. Understanding each step prevents the most common bugs: distorted audio, incorrect volume levels, and desynchronized playback.
Caller → JSON → AI Pipeline
AI Response → JSON → Caller
Python's built-in audioop module handles µ-law conversion and resampling natively. Functions like audioop.ulaw2lin() and audioop.ratecv() perform companding and sample rate conversion. Pipecat's serializers execute all of this for you flawlessly behind the scenes.
Advantages
No Third-Party Platform Cost
There is no LiveKit, VAPI, or Retell layer. You pay Vobiz for the channel and the raw AI API costs (STT, LLM, TTS). At scale, this is a highly significant cost difference.
Direct AI Model Integration
Deepgram, OpenAI, Anthropic, and Cartesia all natively consume/produce WebSocket audio streams. Your Vobiz stream connects almost directly without massive intervening SDK abstractions.
Full Pipeline Control
Every byte of audio flows right through your code. You can implement highly tuned custom VAD logic, custom barge-in states, and bespoke routing trees directly.
Easiest to Debug Locally
A WebSocket server is just a web server. Test locally with ngrok and inspect messages in browser DevTools. SIP debugging requires specialized tools like Wireshark.
Familiar Technology Stack
WebSockets are standard web technologies. Any Python, Node.js, or Go developer intuitively handles them without needing to master legacy SIP headers and RTP setups.
No IP Allowlisting Required
Authentication happens cleanly at the connection level via URL parameters or headers, drastically simpler than IP-based ACLs that break instantly upon provider network changes.
Disadvantages
One Stateful Connection Per Call
You manually shoulder state. 100 concurrent calls = 100 simultaneous open sockets maintaining context buffers tightly. Crashing mid-call instantly destroys that specific call.
Barge-In Requires Implementation
Getting AI interruptions right requires orchestrating VAD thresholds, aborting in-flight TTS playback, flushing buffers, and resetting logic, which is mathematically intricate.
Turn Detection is Extremely Hard
Separating sentence-pauses from true turn-handovers demands machine learning models running locally (Silero VAD) to eliminate jarring false-positive interruptions on background noises.
Conversion Chain Complexity
The µ-law → PCM → AI return cycle is ruthless regarding byte matching. Wrong endianness or ratio mismatches generate severely distorted static that lacks any clear error logs.
TCP vs. UDP Protocol Conflicts
WebSockets mandate reliable TCP handshakes which guarantee package delivery. If an audio packet stalls, TCP retransmission holds up all subsequent arrays, injecting brutal jitter where UDP simply skips it seamlessly.
No Enterprise PBX Out-of-the-Box
Massive internal corporate infrastructures (Avaya, Teams) do not recognize WebSocket streams exclusively. If they are an absolute integration requirement, you definitively need a SIP trunk architecture.
Pipecat Integration
Pipecat (open-source, from Daily.co) is the leading Python framework for building WebSocket-based voice pipelines. It provides abstractions that make streaming architectures production-viable effortlessly — shielding you from audio conversion, pipeline orchestration, VAD, and barge-in logistics automatically.
Pipeline Architecture
Pipecat models a voice call as a linear chain of processors. Each processor receives frames (audio, text, control markers) and hands outputs downstream. A flawless mapping of conceptual voice functionality.
Vobiz Serializer
The VobizFrameSerializer handles all base64 decoding, µ-law ↔ PCM conversion, and Vobiz-specific JSON message framing automatically — so your pipeline receives clean audio frames with no manual conversion code.
Built-In VAD
Integrates highly-calibrated Silero VAD locally out of the box. Intercepts false positives cleanly and initiates intricate TTS-cancellations immediately yielding seamless conversational cadences automatically.
Important Constraints
Pipecat's WebSocketServerTransport binds strictly to one active connection per singular process. Parallel caller influx mandates distinct Pipecat workers running concurrently deployed utilizing independent pools or Docker pods. Single servers executing dual WSS frames identically inherently fail.
Direct Python + Vobiz
The alternative involves orchestrating raw bare-metal ASGI/WSGI servers actively. Taking direct absolute ownership over all encoding layers yielding optimal customizability but vastly heightened difficulty trajectories.
Framework Heavy Lifting
- Automatic µ-law bindings
- Base64 decode array loops
- STT module synchronicity
- Silero threshold tweaking
- Turn-detection history states
Bare-Metal Responsibilities
- Asyncio multiplexer state trees
- Buffer bit-math equations
- Latency compensation code
- In-flight manual TTS pausing
- DTMF intercept flags
When Extreme Control Matters
- 1
Hyper-Bespoke Pipelines
Building experimental multimodal arrays unsupported by standardized open-source serialization components entirely.
- 2
Legacy FastAPI Cohesion
Merging native WSS endpoints directly into vast existing Python monolithic APIs devoid of introducing parallel orchestration tools.
- 3
Milisecond Audio Manipulations
Surgical manual injections of specific IVR branching prompts lacking pipeline abstraction layering delays exclusively.
Key Implementation Factors
Development Complexity
Medium (Pipecat) / High (Bare/Metal)With Pipecat, a working voice agent completes in hours. Bare-metal Python mandates implementing manual byte manipulation equations directly.
Capital Expenditures Strategy
Absolute Minimum ThresholdsYou uniquely isolate operational expenditures specifically isolating AI LLM calls alongside pristine raw Vobiz channel minute metrics only. Highly desirable for scaling.
Baseline Deployment Horizons
2–4 hours (Pipecat)Frameworks allow immediate connectivity logic validation ensuring live microphone inputs transcribe effectively inside one working afternoon smoothly.
Algorithmic Speed Mapping
~Instant Link + 20ms Audio VectorsLack of initial deep network SIP handshakes permits WSS calls snapping active phenomenally faster. Total speed exclusively depends completely on API token deliveries.
Critical Process Scaling
1 Concurrent Frame / 1 Open Port ServerEvery persistent individual call physically holds unique server instances dynamically. High load scenarios intensely tax scaling horizontal architectures exponentially contrasted against typical SIP.
Coding Danger Zones
Oversending Synchronicity Arrays
Delivering unified huge processed TTS streams aggressively into 20ms-designed sockets immediately swamps platform gates leading to vicious package rejections.
Little-Endian Byte Flaws
Python naturally extracts little-endian data structures. If specific LLM endpoints uniquely require rigid big-endian outputs, raw static noise generates evenly confusing debugging.
Persistent OOM Leaks
Failing to rigorously destroy asynchronous STT generators manually upon receipt of absolute WSS physical closure pings yields eventual massive operational memory collapses globally.
Symmetrical Async Locks
Invoking separate isolated parallel routines simultaneously firing towards singleton WSS outputs inevitably trips brutal write collision overlapping fatal socket errors easily.
VAD Metric Interference
Inherent base loudness-checking architectures universally fail against television sets or ambient sirens continuously auto-cancelling critical outgoing response streams entirely.
DTMF Audio Corruption
Injecting raw button frequencies natively directly within the stream violently shatters voice transcription comprehension yielding completely silent outputs. Use discrete outbound events exclusively.
Final Verdict Strategy
Strict Cost Objectives
Eradicates secondary abstraction usage premiums maximizing financial margin scaling securely alone.
Bespoke AI Architecture
Ensures absolute rigid isolation ownership dictating precise stateful graph flow designs manually independently.
Pipecat Centric Builds
Perfect synergy. Framework functions uniquely target native WSS transport boundaries explicitly identically inherently.
Vanguard Prototyping
Pioneers rapid, low-friction Python ngrok execution paths ensuring fully functional AI demonstrations manifest overnight swiftly.
Absence of Referral Scenarios
Permits foregoing SIP's elite built-in live transfer arrays assuming pure singular interaction scopes explicitly successfully.
High-Level Stack Orientation
Empowers standard web-centric Python/JS teams navigating comfortably strictly devoid from mastering obscure Telecom complexities vastly.
Need seamless PBX interaction?
If highly enterprise-compliant transferring interfaces, deep LiveKit routing, or VAPI intermissions form absolute mandatory requirements dynamically — investigate overarching SIP options intently.