Skip to content

SIP Trunking

VS

WebSocket Streaming

These are the two fundamental architectures for connecting phone calls to AI voice agents. The choice is not about which is "better" — it depends entirely on what you are building, your platform choices, and your priorities between deployment speed and total ownership.

SIP Trunking is the telephony industry standard — robust, enterprise-ready, and mandatory when you need call transfer, PBX integration, or managed platforms like LiveKit, VAPI, and Retell AI.

WebSocket Streaming is the developer-native path — highly cost-effective at scale, more direct, and ideal for custom AI pipelines built with Pipecat or bare-metal code.

Architectural Comparison

The two approaches operate at completely different layers of the stack. SIP is a telephony-layer protocol. WebSocket streaming is an application-layer transport. Here is what the full call path looks like end-to-end in each architecture:

SIP Trunking Path

1
PSTN
Caller dials phone number
2
SIP Trunk (Vobiz)
PSTN → SIP INVITE routed to endpoint URI
3
Platform SIP Endpoint
LiveKit / VAPI / Retell terminates SIP, creates room
4
RTP Audio
UDP audio stream directly to platform
5
AI Agent
Receives WebRTC audio, runs STT → LLM → TTS
6
RTP Audio (back)
TTS audio back to caller via UDP
KEY
SIP signaling + RTP media are separate protocols. A managed platform forms the bridge.

WebSocket Streaming Path

1
PSTN
Caller dials phone number
2
Vobiz Webhook
Fetches your webhook, receives VoiceXML stream directive
3
WebSocket (wss://)
Direct TCP connection established to your server
4
Your Server
Receives base64 µ-law audio directly from Vobiz
5
AI Pipeline
STT → LLM → TTS logic driven entirely by your code
6
WebSocket (back)
µ-law voice audio sent back over same socket connection
KEY
No separate managed platform layer. Your own server acts as the direct bridge to VoiceXML.
Important architectural nuance
These two architectures are not completely mutually exclusive. You can use a generic SIP Trunk provider to route a call to Vobiz, and then use a Vobiz VoiceXML &<Stream> directive to pipe that exact call to your custom WebSocket server. SIP handles the initial routing; WebSocket handles the audio layer.

Full Decision Matrix

Evaluation FactorSIP TrunkingWebSocket Streaming
Developer ProfileTelephony admins, DevOps, or teams comfortable with SIP/RTP conceptsPython / Node.js engineers — WebSocket + async patterns are familiar
Setup ComplexityHigh — trunk config, IP ACLs, SIP URI routing, codec negotiation, firewall rulesLow/Medium — VoiceXML webhook + WebSocket server handling JSON.
Call Setup Latency1–5 seconds (SIP INVITE handshake + PSTN routing overhead)Near-instant (WebSocket TCP handshake + Vobiz webhook fetch)
Audio Transport LatencyLower — UDP/RTP has no retransmission. Dropped packets are skipped, preserving real-time flow.Slightly higher — TCP guarantees delivery. Retransmitted packets can add jitter on poor networks.
Audio Quality SupportG.711 8kHz or G.722 16kHz (HD wideband, if chosen carrier supports)G.711 µ-law 8kHz (PSTN floor, same as standard SIP)
Infrastructure CostVobiz trunk rate + AI platform fee (LiveKit/VAPI/Retell markup)Vobiz channel rate + raw AI API costs only. No platform markup.
Live Call TransferSupported — blind and warm transfer via SIP REFERSupported — Vobiz handles call transfer on the WebSocket path as well
Enterprise PBX IntegrationNative — Avaya, Cisco UCM, Teams Direct Routing demand SIPNot applicable — no standard bridge to existing PBX infra.
Turn-Taking / InterruptionAbstracted — Handled completely by the managed platformManual — You must build VAD + async pipeline cancellation
Horizontal ScalingCarrier-layer — add trunk channels without touching server infraProcess-layer — you must scale WebSocket workers/containers

Platform Compatibility Matrix

Which integration path each major AI voice platform natively expects, and what role they play in the overall architecture.

PlatformSIP TrunkingWebSocket StreamingRole in Architecture
LiveKit Primary N/AComplete AI voice platform. SIP trunk terminates into LiveKit SIP Service. AI agent runs as LiveKit participant.
VAPI Primary N/AManaged AI voice platform. BYO SIP trunk or direct SIP URI. PSTN calls route exclusively through SIP trunking.
Retell AI Primary N/AManaged AI voice platform. Elastic SIP trunk or Register Phone Call API (SIP URI dialing).
ElevenLabs Primary N/AConversational AI platform with native SIP integration. Connects directly to PSTN phone calls via SIP trunking. Also provides TTS/STT/voice cloning services for use in other pipelines.
Pipecat N/A PrimaryOpen-source Python pipeline framework. Designed exclusively around WebSocket transport (Twilio, Telnyx, Plivo serializers). No native SIP support.
Direct Python (Vobiz) N/A PrimaryBare-metal WebSocket handler against Vobiz streaming API. Maximum control, maximum ownership.
BolnaSupported PrimaryManaged voice AI orchestration layer. Can integrate via Twilio WebSocket streams or via SIP trunk configuration.
UltravoxSupported PrimaryReal-time AI voice platform. Primary integration via WebSocket audio; SIP via intermediary transport.

Cost Analysis

Cost structure is the most misunderstood difference between these two approaches. The Vobiz channel rate is identical for both paths — the difference comes from whether you add a managed AI platform layer on top (SIP path) or own the pipeline yourself (WebSocket path). All pricing below is in INR.

SIP Trunking Cost Stack

Vobiz SIP channel
45 paise per minute, inbound + outbound. Same rate for both directions. No hidden fees.
₹0.45/min
Phone number (DID)
Per active Vobiz number. Includes inbound PSTN routing to your SIP endpoint.
₹500/month
Managed AI platform
LiveKit, VAPI, Retell, ElevenLabs each charge their own per-minute or subscription rate on top. This is the main cost driver at scale.
Their pricing
STT (e.g. Deepgram)
VAPI/Retell include STT in their platform pricing. LiveKit agents require your own STT API key.
Included or API
LLM (e.g. GPT-4o)
Platforms typically pass through OpenAI/Anthropic costs or charge a bundled per-minute rate.
API key required
TTS (e.g. ElevenLabs)
Voice synthesis billed per character or per minute depending on provider.
API key required
Vobiz base cost: ₹0.45/min + ₹500/month per number.
Total cost = Vobiz rate + AI platform fees + STT/LLM/TTS API costs.

WebSocket Streaming Cost Stack

Vobiz channel rate
65 paise per minute, inbound + outbound. WebSocket streaming rate on Vobiz.
₹0.65/min
Phone number (DID)
Same ₹500/month per number as SIP. No difference.
₹500/month
No managed AI platform
You build the pipeline yourself. No platform layer, no platform markup. This is the key saving.
₹0
STT (e.g. Deepgram)
You pay the STT provider directly. Deepgram Nova-2 streaming is one of the cheapest options available.
Direct API rate
LLM (e.g. GPT-4o-mini)
Pay OpenAI, Anthropic, or Google directly per token. No platform intermediary taking a cut.
Direct API rate
TTS (e.g. Cartesia / ElevenLabs)
You choose the TTS provider and pay them directly. Cartesia is significantly cheaper than ElevenLabs at volume.
Direct API rate
Server compute
One process per concurrent call. Budget for your worker pool infrastructure (VMs, containers, serverless).
Cloud infra
Vobiz WebSocket rate: ₹0.65/min + ₹500/month per number.
Total cost = Vobiz rate + direct AI API costs only. No platform markup.
The Bottom Line
Vobiz SIP costs ₹0.45/min and WebSocket costs ₹0.65/min — both with ₹500/month per phone number. The real cost difference is the platform layer. With SIP + a managed platform you pay their per-minute or subscription fee on top of Vobiz. With WebSocket streaming you pay your AI APIs directly and cut out the middleman. Under 50,000 calls/month, the platform premium is often worth the saved engineering time. Above 50,000 calls/month, owning the pipeline pays off significantly.

Latency Analysis

SIP has lower audio transport latency than WebSocket streaming. SIP uses UDP/RTP — a fire-and-forget protocol that never retransmits dropped packets, keeping audio delivery strictly real-time. WebSocket runs over TCP, which guarantees delivery by retransmitting lost packets — useful for data, but a source of jitter for live audio on poor networks.

If latency is the only factor you care about, SIP wins. But latency is rarely why developers choose WebSocket streaming. They choose it for the ecosystem— direct access to AI frameworks (Pipecat), raw STT/LLM/TTS APIs, full pipeline control, and lower cost. Those benefits come with a small latency trade-off that is imperceptible in practice.

Transport Latency — SIP vs WebSocket

Vobiz Telephony Layer
Audio delivery from PSTN to your server — under 50ms on both paths
< 50ms
Lower Latency
SIP Trunking — UDP/RTP
Vobiz base delivery<50ms
Audio transport (UDP/RTP)~20ms frames
Packet loss handlingSkipped — no delay
Audio jitter on poor networksMinimal
UDP drops packets and moves on. Audio stays in sync even when the network degrades. This is why PSTN and broadcast media use RTP.
Richer Ecosystem
WebSocket Streaming — TCP
Vobiz base delivery<50ms
Audio transport (TCP/WebSocket)~20ms frames
Packet loss handlingRetransmit — adds delay
Audio jitter on poor networksCan spike
TCP retransmits lost packets, which can cause bursts of delayed audio on congested networks. On stable connections this difference is negligible.
Then why do developers choose WebSocket? Not for latency — SIP wins there. They choose WebSocket for the ecosystem: Pipecat, direct Deepgram/OpenAI/ElevenLabs API access, full pipeline control, and lower cost by eliminating managed platform fees. The TCP latency trade-off is real but imperceptible on stable connections in practice.

AI Pipeline Latency — Identical on Both Transports

Regardless of transport, the dominant latency is always the AI pipeline. Vobiz delivers audio in under 50ms on both paths. Everything after that is STT + LLM + TTS — and that is where 95%+ of the perceived wait comes from.

Vobiz audio delivery
<50ms
Platform contribution — same on SIP and WebSocket
VAD: end-of-turn detection
100–400ms
Silence threshold + VAD model decision
STT: speech → transcript
150–300ms
Deepgram Nova-2 streaming, first result
LLM: transcript → first token
150–400ms
GPT-4o-mini ~150ms · GPT-4o ~250ms · Claude ~200ms
TTS: text → first audio chunk
100–300ms
Cartesia ~100ms · ElevenLabs ~200ms · OpenAI TTS ~200ms
Total: caller stops → first AI word
500ms–1.4s
SIP and WebSocket both contribute <50ms of this total. Optimise your AI pipeline, not your transport.

Scaling Comparison

SIP Trunking: Carrier-Layer Scaling

Concurrent call capacity scales at the carrier level. Adding 100 more simultaneous calls means increasing your Vobiz trunk channel count — a configuration change. Your server infrastructure (the AI platform like LiveKit) scales independently using its own horizontal scaling mechanisms.

Trunk capacity: instant provisioning, no deployment
Platform scaling: managed by LiveKit/VAPI/Retell
Your agent code scales independently of calls
Platform costs scale linearly with call volume
Platform SLA limits may apply at high concurrency

WebSocket Streaming: Process-Per-Call Scaling

Every active call is a persistent WebSocket connection consuming CPU (audio conversion, AI processing) and memory (call state, audio buffers). You must provision server capacity proportional to peak concurrent calls.

No per-call platform fee — cost stays flat per call
Full control over scaling architecture (containers, serverless, workers)
One process per concurrent call (Pipecat limitation)
Server infrastructure provisioning is your responsibility
Call routing to available worker processes requires orchestration
Practical scaling targets: With Pipecat on a 2-vCPU / 4GB RAM container, expect to handle 1–3 concurrent calls per container depending on AI model latency and audio processing overhead. For 100 concurrent calls, plan for 40–100 containers behind a load balancer that routes new WebSocket connections to available workers.

When to Choose SIP Trunking

You are integrating with LiveKit, VAPI, or Retell AI

All three platforms are SIP-native. Point your Vobiz trunk at their SIP endpoint and you are live in hours. Their tooling, docs, and support are built around SIP.

You are running under 50,000 calls per month

At this volume, the engineering time saved by using a managed SIP platform outweighs the per-call platform cost. Build fast, ship fast, optimise later.

You are connecting to enterprise PBX infrastructure

Avaya, Cisco UCM, Microsoft Teams Direct Routing — all speak SIP natively. No practical alternative for enterprise telephony integration.

You want the fastest time to production

Managed SIP platforms (VAPI, Retell, LiveKit) handle STT, LLM, TTS, and call infrastructure for you. A working AI agent can be live in a day without building a pipeline.

Regulated industry (PCI-DSS, HIPAA)

SIP with SRTP + TLS is the established standard for compliant voice deployments. Enterprise audit trails and security certifications are better supported.

When to Choose WebSocket Streaming

You are running over 50,000 calls per month

At this volume, the per-call platform markup on a managed SIP platform adds up fast. Owning the WebSocket pipeline and paying Vobiz + raw AI APIs directly saves significantly at scale.

You are building a custom AI pipeline (Pipecat, bare-metal Python)

Pipecat is designed for WebSocket transport. If you are wiring up your own STT + LLM + TTS stack, WebSocket streaming gives you raw audio direct to your server — no platform constraints.

You want full control over the audio pipeline

Custom VAD logic, barge-in handling, proprietary STT models, multi-step routing, audio injection — if you want to own every layer, WebSocket streaming is the only path.

Your team is Python or Node.js native

WebSocket streaming maps to skills your team already has. The challenge is audio encoding (µ-law), not telephony protocols. No SIP expertise required.

Rapid prototyping without platform dependency

A single Python file with FastAPI + ngrok is a fully working voice bot. No managed platform account, no SIP trunk config, no IP ACLs. Fastest zero-to-demo path.

No call transfer or enterprise PBX required

If your use case is purely AI-answered inbound calls with no human handoff and no PBX routing, WebSocket streaming is simpler, cheaper, and gives you more control.


Decision Flowchart

Answer each question in order. Stop at the first definitive answer.

1

Do you need to transfer live calls to a human agent?

E.g. escalating from an AI agent to a live support rep mid-call.

Yes
Both paths support this

Vobiz supports live call transfer on both SIP and WebSocket. Continue to question 2 to choose based on other factors.

No
Continue to Question 2
2

Are you integrating with LiveKit, VAPI, or Retell AI?

Managed platforms that handle STT, LLM, TTS, and call infrastructure for you.

Yes
Use SIP Trunking

These platforms are SIP-native. Point your Vobiz trunk at their SIP endpoint and you are live.

No
Continue to Question 3
3

How many calls do you expect per month?

The key cost inflection point between managed platforms and owning the pipeline.

< 50,000/mo
Use SIP Trunking

Platform overhead is manageable. Save weeks of engineering time vs. building your own pipeline.

> 50,000/mo
Consider WebSocket

Per-call platform markup compounds. Owning your own WebSocket pipeline pays off at this scale.

4

Are you building a custom AI pipeline — or do you want full control?

E.g. Pipecat, bare-metal Python, or owning every layer of audio processing.

Yes
Use WebSocket Streaming

Raw audio direct to your server. Fast to build. No managed platform standing between you and the call data.

No
Use SIP Trunking

Let a managed platform (LiveKit, VAPI, Retell) handle the complexity. Fastest time to production.


Migration Path

A common pattern for teams building production voice AI: start with WebSocket streaming (fast to prototype, cheap to run, minimal infrastructure) and migrate to SIP when the product needs call transfer, enterprise PBX integration, or when the team is ready to adopt a managed platform like LiveKit.

Phase 1: WebSocket Streaming
Pipecat + Vobiz webhook
Deepgram + OpenAI + ElevenLabs
Validate product, control costs
Weeks 1–8
Phase 2: Add SIP Layer
Vobiz SIP trunk → LiveKit
Enable call transfer
Connect to enterprise PBX
When product is proven
Phase 3: Hybrid
SIP for routing + call control
WebSocket for audio processing
Both in same architecture
At scale

Note that your Vobiz DID number and channel setup stays the same across both phases — only the downstream routing configuration changes. There is no re-provisioning or number porting required when adding SIP routing to an existing WebSocket-based deployment.

Vobiz Recommendation

Choose SIP Trunking when:
  • • Integrating with LiveKit, VAPI, or Retell AI
  • • Call transfer to humans is required
  • • Enterprise PBX or call center integration
  • • Regulated industry compliance needed
SIP Trunking deep dive →
Choose WebSocket Streaming when:
  • • Building custom pipeline (Pipecat, Python)
  • • Optimizing for cost at scale
  • • Rapid prototyping with familiar stack
  • • No transfer or PBX requirements
WebSocket Streaming deep dive →