/ws endpoint is the control plane for every real-time experience. A single socket handles streaming audio, transcription events, TTS playback, and LiveKit data. This page distills the full protocol from the engineering brief (tmp/docs/websocket.mdoc) into actionable reference material.
End-to-end flow
Handshake
Open a WS connection (
ws://host:3001/ws or wss://). Ping/pong is automatic; idle sockets close after ~10 s of silence.Config
The very first client message must be
type: "config". Sayna injects credentials, boots STT/TTS providers, provisions LiveKit (if configured), warms caches, and responds with ready once every subsystem is up (30 s timeout).Active session
Stream binary PCM/Opus frames, queue
speak jobs, drive the loading indicator with loading_start / loading_stop, forward LiveKit data via send_message, and react to stt_result, message, participant_disconnected, and binary TTS audio frames.Message taxonomy
| Direction | Type | Purpose |
|---|---|---|
| Client → Server | config | Declare STT/TTS/livekit setup; must come first. |
| Binary audio | Raw PCM/Opus chunks that match stt_config. | |
speak | Queue TTS synthesis (optionally override defaults per call). | |
clear | Flush TTS buffers (WebSocket + LiveKit) unless playback is locked. | |
loading_start | Begin looping the configured loading_audio clip on a dedicated loading-audio LiveKit track. | |
loading_stop | Stop the loading loop with a short fade-out. Idempotent and silent server-side. | |
send_message | Publish JSON payloads over the LiveKit data channel. | |
sip_transfer | Transfer the current SIP call to another phone number. | |
update_config | Dynamically adjust VAD and turn detection parameters. | |
| Server → Client | ready | Providers + LiveKit are initialized; safe to stream. |
stt_result | Interim/final transcripts with is_final / is_speech_final. | |
| Binary audio | TTS playback chunks in the negotiated format. | |
message | Informational notices (LiveKit state, provider hints, etc.). | |
tts_playback_complete | Signals a speak request finished emitting audio. | |
participant_connected | LiveKit peer joined the room. | |
participant_disconnected | LiveKit peer left → Sayna closes the socket after 100 ms. | |
track_subscribed | Sayna subscribed to a participant’s media track. | |
vad_event | VAD activity change: speech start, silence, resume, or turn end. | |
error | Validation/provider issues; connection usually stays open so you can retry. | |
sip_transfer_error | SIP transfer operation failed. |
Configuration message
audio=falseskips STT/TTS initialization but still allows LiveKit control + data channel relays.stream_idis optional; provide one to control recording paths and correlate sessions, or let Sayna generate a UUID v4 and return it in thereadymessage.stt_config.sample_rate/channels/encodingmust exactly match the binary frames you send—Sayna will not resample.tts_configvalues become session defaults; individualspeakmessages can override most fields or addidfor correlation.stt_config.authandtts_config.authare optional objects that supply provider credentials for this session only. When provided, they fully replace server-configured credentials (no partial merge). Omit or send{}to use server defaults. See Provider auth overrides below for supported shapes per provider.- The
livekitblock is optional but required forsend_messageand mirrored playback/recording. loading_audiois optional and additive; it is only processed whenaudio=trueand thelivekitblock is present, and a decode failure is non-fatal (the session continues without loading capability). See the dedicated Loading indicator section below.
Streaming audio best practices
- Chunk size: ~100–200 ms of audio per binary frame (e.g., 3 200 bytes for 16 kHz mono linear16).
- Maintain consistent cadence; bursty uploads increase STT latency.
- Enable
punctuation+turn-detect(default features) when you needis_speech_finalcues to know when a speaker stopped. - If the client detects a barge-in (human interrupts AI), send
{"type":"clear"}before queueing newspeaktext.
Client commands in detail
speak
flush=trueclears queued audio before speaking; setfalseto build playlists.allow_interruption=falseprotects critical prompts even if the client sendsclear.tts_configmay override any session default, including provider credentials via theauthfield.- Finish signal arrives via
tts_playback_completewith the sameid.
loading_start
"loading-audio" LiveKit track. Fire-and-forget; idempotent if the loop is already running. Requires audio=true, an active LiveKit room, and a loading_audio clip supplied in the initial config message. Failures (no config yet, audio disabled, no LiveKit room, missing or undecodable loading_audio, track publish failure) surface as error messages on the existing error channel.
loading_stop
speak does not stop the loop, applications must send loading_stop before (or together with) speak in the normal “thinking finished, now answer” flow. Otherwise the loading sound continues playing under the spoken answer.
send_message
livekit block in the configuration message.
Server events in detail
ready
Echoes the session stream_id (provided or auto-generated), LiveKit metadata (room name, URLs, participant identity), and signals that STT/TTS providers finished booting. Use that stream_id later with GET /recording/{stream_id} when recording is enabled. Other participants still need to call POST /livekit/token to join the room.
stt_result
- Interim updates set
is_final=falseand keep streaming as long as audio flows. - The final chunk for an utterance flips
is_final=true. is_speech_final=trueindicates the turn detector heard silence—ideal for triggering a response.
message and participant_disconnected
Utility events for LiveKit mirroring:
participant_disconnected event arrives, expect Sayna to close the WebSocket 100 ms later so the LiveKit room is torn down cleanly.
Binary TTS payloads
Binary frames deliver synthesized audio in the format you declared (e.g., 24 kHz linear16). Handle them alongside JSON control frames:Provider auth overrides
Bothstt_config and tts_config accept an optional auth object that lets clients supply provider credentials per WebSocket session or per POST /speak request. This is useful for multi-tenant deployments, customer-managed provider keys, and per-request credential routing.
Fallback behavior
- If
authis omitted, the server uses its configured provider credentials. - If
authis sent as an empty object{}, the server also falls back to its configured credentials.
No partial merge
Ifauth is provided, it must be complete for that provider. The server does not merge a partial auth object with server-side credentials. For example, sending Azure auth with only api_key and no region is invalid.
Supported auth shapes
| Provider | Shape | Required fields |
|---|---|---|
| Deepgram | { "api_key": "..." } | api_key |
| ElevenLabs | { "api_key": "..." } | api_key |
| Cartesia | { "api_key": "..." } | api_key |
| Google Cloud | { "credentials": "..." } | credentials (file path string or JSON service account object) |
| Azure Speech | { "api_key": "...", "region": "..." } | api_key, region |
Validation
Clients should expect session or request initialization to fail if:- Required auth fields for the selected provider are missing.
- Auth field types are invalid.
- Unsupported auth keys are sent for that provider.
- The provider name is unsupported.
provider value in stt_config.provider or tts_config.provider determines how the auth object is interpreted — the auth object itself does not include a provider discriminator.
Loading indicator
The loading indicator is a short audio clip looped into the LiveKit room while your application is busy (“thinking”) and cannot yet answer. It is the audio equivalent of a spinner. The clip plays on its own dedicated LiveKit audio track named"loading-audio", separate from the "tts-audio" speech track, so it never interferes with text-to-speech output.
loading_audio configuration
Include a loading_audio object in the initial config message. The server decodes and validates the clip once at config time and holds the validated buffer for the session.
| Field | Type | Required | Description |
|---|---|---|---|
data | string | Yes | Base64-encoded audio bytes — a complete WAV file or raw 16-bit little-endian PCM. |
format | "wav" | "pcm" | No | Container hint. If omitted, the server auto-detects from the RIFF/WAVE signature. |
sample_rate | integer | Conditional | Sample rate in Hz (8000–48000). Required for raw PCM; ignored for WAV (the header is authoritative). |
channels | integer | No | 1 (mono) or 2 (stereo). Defaults to 1. Ignored for WAV. |
volume | number | No | Playback volume 0.0–1.0. Default 1.0; out-of-range values are clamped; applied once at config time. |
- Only 16-bit PCM is supported (16-bit integer WAV, or raw 16-bit little-endian PCM). Other bit depths are rejected.
- Decoded duration: ~250 ms to ~10 s, 1–4 s recommended.
- Decoded size is capped at ~1.9 MB.
- Author the clip so its start and end amplitudes match — otherwise the loop seam will click.
- Author the clip quiet enough to sit under speech rather than compete with it.
loading_audiois processed only whenaudio=trueand alivekitblock is present in the sameconfigmessage. Supplied without those, the server emits anerrorand continues the session without loading capability.- Decode or validation failures are non-fatal — the session continues, STT/TTS are unaffected, and only the loading feature is unavailable. A later
loading_startre-reports the original decode error so the failure is never silent.
Controlling the loop
The loop is controlled exclusively by theloading_start and loading_stop commands. speak and clear do not affect it. Server-side, the loop also stops automatically on session teardown / disconnect.
The server does not stop the loop when VAD detects user speech. If you want the loading audio to end on barge-in, send loading_stop from your vad_event or stt_result handler.
Typical turn
LiveKit interplay
- Sayna manages the agent-side participant automatically; humans still fetch tokens via
POST /livekit/token. listen_participantslimits which identities can feed audio back into Sayna.- Recordings respect your LiveKit/S3 configuration when
enable_recording=true; files are stored at{server_prefix}/{stream_id}/audio.oggusing thestream_idfrom your config or the ready message. send_messageandclearaffect both the WebSocket client and LiveKit room so playback remains in sync.
Troubleshooting
| Symptom | Likely cause | Fix |
|---|---|---|
error: STT and TTS configurations required when audio is enabled | audio=true but one of the configs missing. | Provide both stt_config and tts_config, or set audio=false. |
WebSocket closes right after participant_disconnected | LiveKit peer left. | Reconnect / reconfigure to start a new session. |
| Garbled transcripts | Audio frame format doesn’t match stt_config. | Ensure sample rate, channels, and encoding align with captured audio. |
| LiveKit data messages rejected | send_message used without livekit section. | Include LiveKit config in the initial config payload. |
| Loading audio plays under the spoken answer | speak was sent before loading_stop. | Send loading_stop immediately before (or together with) speak. |