Need request/response schemas? Pair this overview with the REST API, WebSocket, and SIP configuration references.
System landscape
Traffic entry points
WebSocket (/ws) – The main control plane. Clients connect once, send a config payload (providers + LiveKit settings), and then stream audio, emit speak commands, or relay LiveKit data with sub-second latency.
REST endpoints – Provide complementary one-off operations:
/voiceslists provider catalogs for UI pickers./speakperforms one-shot synthesis when a persistent socket is overkill./livekit/tokenmints participant tokens so browser/mobile clients can join the same LiveKit room that Sayna already inhabits.
sip block is populated, Sayna exposes a SIP domain/IP. Carriers such as Twilio point their Origination URI at that address (e.g., sip:sip.sayna.ai;transport=tcp). Sayna enforces room_prefix constraints, filters source IPs, and forwards LiveKit webhook payloads to the hosts defined in SIP_HOOKS_JSON.
Voice orchestration pipeline
- Configuration – Sayna validates the
configmessage, injects server-side API keys, initializes STT/TTS providers, loads DSP assets (turn detection, noise filtering), and emitsreadyonce everything is live. - Streaming – Binary frames flow through the STT connector;
stt_resultevents stream back withis_finalandis_speech_finalhints so you know when to respond. - Synthesis –
speakcommands enqueue TTS jobs. Audio streams back as binary frames and, when LiveKit is enabled, mirrors into the room for human listeners. - Caching – TTS outputs are hashed by text + config and stored under
CACHE_PATH, so repeated prompts replay instantly. - Error handling – Most faults surface as JSON
errorevents while keeping the socket open, giving clients room to retry.
LiveKit & SIP interplay
- Sayna runs its own LiveKit participant (identity defaults to
sayna-ai). Other clients call/livekit/tokento get their own access tokens; Sayna never shares its agent keys. - When
enable_recording=true, Sayna asks LiveKit to start composite recordings to the S3 target you configured. - SIP mode auto-provisions the LiveKit SIP trunk + dispatch rule (
sayna-{room_prefix}-trunk/dispatch). You only configure the carrier; Sayna takes care of LiveKit. - The SIP dispatcher reads
sip.h.toheaders from LiveKit webhook events and forwards the raw JSON payload to whichever hook hostname matches—perfect for per-domain routing or downstream analytics.
Component responsibilities
| Component | Purpose | Key considerations |
|---|---|---|
| Edge & optional auth | Terminates TLS/WebSockets and enforces API secret or delegated JWT policies before traffic reaches /ws or REST handlers. | Pair with AUTH_REQUIRED=true when exposing Sayna on the public internet. |
| API & WebSocket gateway | Hosts /ws, /voices, /speak, /livekit/token, validates payloads, and multiplexes JSON/binary frames. | Idle sockets close after ~10 s; errors rarely tear down the connection. |
| Voice orchestrator | Manages per-session state (providers, caches, DSP), schedules STT/TTS work, and prevents mismatched streams. | Configuration drives everything—use consistent stt_config and tts_config when you need cache hits. |
| Provider connectors | Abstract Deepgram STT/TTS and ElevenLabs TTS behind a unified schema. | Outbound HTTPS is required; provide the relevant API keys via env vars. |
| LiveKit transport | Joins rooms, mirrors Sayna’s audio into WebRTC, relays data-channel messages, and coordinates recordings. | /livekit/token keeps your own agentic logic and participants in sync with Sayna’s LiveKit room. |
| SIP hook router | Enforces room_prefix, respects SIP_ALLOWED_ADDRESSES, and forwards webhook payloads to domain-specific HTTPS endpoints. | Combine with Twilio SIP setup when routing PSTN callers into Sayna. |
Common deployment patterns
| Pattern | Flow |
|---|---|
| WebSocket-only assistant | App connects to /ws → sends config (no LiveKit) → streams mic audio → reacts to stt_result → sends speak for replies. |
| Hybrid WebSocket + LiveKit | Sayna joins LiveKit via config → browsers fetch /livekit/token and join the room → LiveKit audio feeds Sayna’s STT, speak audio mirrors back to participants, data-channel messages sync UI state. |
| PSTN ingress via Twilio | Twilio trunk dials sip.yourdomain.com;transport=tcp → Sayna validates room_prefix and source IPs → auto-provisioned LiveKit trunk/dispatch routes the call → /livekit/token lets agents join from browsers while SIP hooks notify backend systems. |
Operational checklist
- Dependencies – Outbound HTTPS to providers + LiveKit; inbound TCP from carrier IPs when SIP is enabled.
- Scaling –
/wssessions are stateful. Run multiple Sayna instances behind a load balancer (with sticky sessions if the proxy terminates WebSockets). - Observability – Monitor
readyvserrorrates, STT latency, cache hit ratios, SIP provisioning logs, and webhook forwarding success. - Security – Use
AUTH_REQUIRED=trueplus API secrets or delegated JWT to protect REST + WebSocket traffic, and keep SIP hooks HTTPS-only.