/ws endpoint is the control plane for every real-time experience. A single socket handles streaming audio, transcription events, TTS playback, and LiveKit data. This page distills the full protocol from the engineering brief (tmp/docs/websocket.mdoc) into actionable reference material.
End-to-end flow
1
Handshake
Open a WS connection (
ws://host:3001/ws or wss://). Ping/pong is automatic; idle sockets close after ~10 s of silence.2
Config
The very first client message must be
type: "config". Sayna injects credentials, boots STT/TTS providers, provisions LiveKit (if configured), warms caches, and responds with ready once every subsystem is up (30 s timeout).3
Active session
Stream binary PCM/Opus frames, queue
speak jobs, forward LiveKit data via send_message, and react to stt_result, message, participant_disconnected, and binary TTS audio frames.4
Cleanup
On close (client, timeout, or LiveKit departure) Sayna halts providers, stops recordings, leaves/deletes the room, flushes caches, and releases file handles.
Message taxonomy
| Direction | Type | Purpose |
|---|---|---|
| Client → Server | config | Declare STT/TTS/livekit setup; must come first. |
| Binary audio | Raw PCM/Opus chunks that match stt_config. | |
speak | Queue TTS synthesis (optionally override defaults per call). | |
clear | Flush TTS buffers (WebSocket + LiveKit) unless playback is locked. | |
send_message | Publish JSON payloads over the LiveKit data channel. | |
| Server → Client | ready | Providers + LiveKit are initialized; safe to stream. |
stt_result | Interim/final transcripts with is_final / is_speech_final. | |
| Binary audio | TTS playback chunks in the negotiated format. | |
message | Informational notices (LiveKit state, provider hints, etc.). | |
tts_playback_complete | Signals a speak request finished emitting audio. | |
participant_disconnected | LiveKit peer left → Sayna closes the socket after 100 ms. | |
error | Validation/provider issues; connection usually stays open so you can retry. |
Configuration message
audio=falseskips STT/TTS initialization but still allows LiveKit control + data channel relays.stream_idis optional; provide one to control recording paths and correlate sessions, or let Sayna generate a UUID v4 and return it in thereadymessage.stt_config.sample_rate/channels/encodingmust exactly match the binary frames you send—Sayna will not resample.tts_configvalues become session defaults; individualspeakmessages can override most fields or addidfor correlation.- The
livekitblock is optional but required forsend_messageand mirrored playback/recording.
Streaming audio best practices
- Chunk size: ~100–200 ms of audio per binary frame (e.g., 3 200 bytes for 16 kHz mono linear16).
- Maintain consistent cadence; bursty uploads increase STT latency.
- Enable
punctuation+turn-detect(default features) when you needis_speech_finalcues to know when a speaker stopped. - If the client detects a barge-in (human interrupts AI), send
{"type":"clear"}before queueing newspeaktext.
Client commands in detail
speak
flush=trueclears queued audio before speaking; setfalseto build playlists.allow_interruption=falseprotects critical prompts even if the client sendsclear.tts_configmay override any default except provider credentials.- Finish signal arrives via
tts_playback_completewith the sameid.
send_message
livekit block in the configuration message.
Server events in detail
ready
Echoes the session stream_id (provided or auto-generated), LiveKit metadata (room name, URLs, participant identity), and signals that STT/TTS providers finished booting. Use that stream_id later with GET /recording/{stream_id} when recording is enabled. Other participants still need to call POST /livekit/token to join the room.
stt_result
- Interim updates set
is_final=falseand keep streaming as long as audio flows. - The final chunk for an utterance flips
is_final=true. is_speech_final=trueindicates the turn detector heard silence—ideal for triggering a response.
message and participant_disconnected
Utility events for LiveKit mirroring:
participant_disconnected event arrives, expect Sayna to close the WebSocket 100 ms later so the LiveKit room is torn down cleanly.
Binary TTS payloads
Binary frames deliver synthesized audio in the format you declared (e.g., 24 kHz linear16). Handle them alongside JSON control frames:LiveKit interplay
- Sayna manages the agent-side participant automatically; humans still fetch tokens via
POST /livekit/token. listen_participantslimits which identities can feed audio back into Sayna.- Recordings respect your LiveKit/S3 configuration when
enable_recording=true; files are stored at{server_prefix}/{stream_id}/audio.oggusing thestream_idfrom your config or the ready message. send_messageandclearaffect both the WebSocket client and LiveKit room so playback remains in sync.
Troubleshooting
| Symptom | Likely cause | Fix |
|---|---|---|
error: STT and TTS configurations required when audio is enabled | audio=true but one of the configs missing. | Provide both stt_config and tts_config, or set audio=false. |
WebSocket closes right after participant_disconnected | LiveKit peer left. | Reconnect / reconfigure to start a new session. |
| Garbled transcripts | Audio frame format doesn’t match stt_config. | Ensure sample rate, channels, and encoding align with captured audio. |
| LiveKit data messages rejected | send_message used without livekit section. | Include LiveKit config in the initial config payload. |