Skip to main content
Sayna sits between your apps, telecom carriers, and third-party speech providers. Use this page to understand the big-picture data flows before diving into specific APIs.
Need request/response schemas? Pair this overview with the REST API, WebSocket, and SIP configuration references.

System landscape

Traffic entry points

WebSocket (/ws) – The main control plane. Clients connect once, send a config payload (providers + LiveKit settings), and then stream audio, emit speak commands, or relay LiveKit data with sub-second latency. REST endpoints – Provide complementary one-off operations:
  • /voices lists provider catalogs for UI pickers.
  • /speak performs one-shot synthesis when a persistent socket is overkill.
  • /livekit/token mints participant tokens so browser/mobile clients can join the same LiveKit room that Sayna already inhabits.
SIP ingress – When the sip block is populated, Sayna exposes a SIP domain/IP. Carriers such as Twilio point their Origination URI at that address (e.g., sip:sip.sayna.ai;transport=tcp). Sayna enforces room_prefix constraints, filters source IPs, and forwards LiveKit webhook payloads to the hosts defined in SIP_HOOKS_JSON.

Voice orchestration pipeline

  1. Configuration – Sayna validates the config message, injects server-side API keys, initializes STT/TTS providers, loads DSP assets (turn detection, noise filtering), and emits ready once everything is live.
  2. Streaming – Binary frames flow through the STT connector; stt_result events stream back with is_final and is_speech_final hints so you know when to respond.
  3. Synthesisspeak commands enqueue TTS jobs. Audio streams back as binary frames and, when LiveKit is enabled, mirrors into the room for human listeners.
  4. Caching – TTS outputs are hashed by text + config and stored under CACHE_PATH, so repeated prompts replay instantly.
  5. Error handling – Most faults surface as JSON error events while keeping the socket open, giving clients room to retry.

LiveKit & SIP interplay

  • Sayna runs its own LiveKit participant (identity defaults to sayna-ai). Other clients call /livekit/token to get their own access tokens; Sayna never shares its agent keys.
  • When enable_recording=true, Sayna asks LiveKit to start composite recordings to the S3 target you configured.
  • SIP mode auto-provisions the LiveKit SIP trunk + dispatch rule (sayna-{room_prefix}-trunk/dispatch). You only configure the carrier; Sayna takes care of LiveKit.
  • The SIP dispatcher reads sip.h.to headers from LiveKit webhook events and forwards the raw JSON payload to whichever hook hostname matches—perfect for per-domain routing or downstream analytics.

Component responsibilities

ComponentPurposeKey considerations
Edge & optional authTerminates TLS/WebSockets and enforces API secret or delegated JWT policies before traffic reaches /ws or REST handlers.Pair with AUTH_REQUIRED=true when exposing Sayna on the public internet.
API & WebSocket gatewayHosts /ws, /voices, /speak, /livekit/token, validates payloads, and multiplexes JSON/binary frames.Idle sockets close after ~10 s; errors rarely tear down the connection.
Voice orchestratorManages per-session state (providers, caches, DSP), schedules STT/TTS work, and prevents mismatched streams.Configuration drives everything—use consistent stt_config and tts_config when you need cache hits.
Provider connectorsAbstract Deepgram STT/TTS and ElevenLabs TTS behind a unified schema.Outbound HTTPS is required; provide the relevant API keys via env vars.
LiveKit transportJoins rooms, mirrors Sayna’s audio into WebRTC, relays data-channel messages, and coordinates recordings./livekit/token keeps your own agentic logic and participants in sync with Sayna’s LiveKit room.
SIP hook routerEnforces room_prefix, respects SIP_ALLOWED_ADDRESSES, and forwards webhook payloads to domain-specific HTTPS endpoints.Combine with Twilio SIP setup when routing PSTN callers into Sayna.

Common deployment patterns

PatternFlow
WebSocket-only assistantApp connects to /ws → sends config (no LiveKit) → streams mic audio → reacts to stt_result → sends speak for replies.
Hybrid WebSocket + LiveKitSayna joins LiveKit via config → browsers fetch /livekit/token and join the room → LiveKit audio feeds Sayna’s STT, speak audio mirrors back to participants, data-channel messages sync UI state.
PSTN ingress via TwilioTwilio trunk dials sip.yourdomain.com;transport=tcp → Sayna validates room_prefix and source IPs → auto-provisioned LiveKit trunk/dispatch routes the call → /livekit/token lets agents join from browsers while SIP hooks notify backend systems.

Operational checklist

  • Dependencies – Outbound HTTPS to providers + LiveKit; inbound TCP from carrier IPs when SIP is enabled.
  • Scaling/ws sessions are stateful. Run multiple Sayna instances behind a load balancer (with sticky sessions if the proxy terminates WebSockets).
  • Observability – Monitor ready vs error rates, STT latency, cache hit ratios, SIP provisioning logs, and webhook forwarding success.
  • Security – Use AUTH_REQUIRED=true plus API secrets or delegated JWT to protect REST + WebSocket traffic, and keep SIP hooks HTTPS-only.
With this architecture in mind you can choose the right combination of WebSocket, REST, LiveKit, and SIP capabilities for your deployment before implementing the finer details.