WebSocket API

Sayna’s /ws endpoint is the control plane for every real-time experience. A single socket handles streaming audio, transcription events, TTS playback, and LiveKit data. This page distills the full protocol from the engineering brief (tmp/docs/websocket.mdoc) into actionable reference material.

End-to-end flow

Client ──► Connect /ws
          └─► send {"type":"config", ...}
Sayna  ──► validate + spin up providers + optional LiveKit join
Sayna  ──► send {"type":"ready", ...}
Client ◄── start streaming audio / send commands
Client & Sayna ── keep exchanging JSON control frames + binary audio
Either side closes socket → Sayna tears down providers + LiveKit room

Handshake

Open a WS connection (ws://host:3001/ws or wss://). Ping/pong is automatic; idle sockets close after ~10 s of silence.

Config

The very first client message must be type: "config". Sayna injects credentials, boots STT/TTS providers, provisions LiveKit (if configured), warms caches, and responds with ready once every subsystem is up (30 s timeout).

Active session

Stream binary PCM/Opus frames, queue speak jobs, forward LiveKit data via send_message, and react to stt_result, message, participant_disconnected, and binary TTS audio frames.

Cleanup

On close (client, timeout, or LiveKit departure) Sayna halts providers, stops recordings, leaves/deletes the room, flushes caches, and releases file handles.

Message taxonomy

Direction	Type	Purpose
Client → Server	`config`	Declare STT/TTS/livekit setup; must come first.
	Binary audio	Raw PCM/Opus chunks that match `stt_config`.
	`speak`	Queue TTS synthesis (optionally override defaults per call).
	`clear`	Flush TTS buffers (WebSocket + LiveKit) unless playback is locked.
	`send_message`	Publish JSON payloads over the LiveKit data channel.
Server → Client	`ready`	Providers + LiveKit are initialized; safe to stream.
	`stt_result`	Interim/final transcripts with `is_final` / `is_speech_final`.
	Binary audio	TTS playback chunks in the negotiated format.
	`message`	Informational notices (LiveKit state, provider hints, etc.).
	`tts_playback_complete`	Signals a `speak` request finished emitting audio.
	`participant_disconnected`	LiveKit peer left → Sayna closes the socket after 100 ms.
	`error`	Validation/provider issues; connection usually stays open so you can retry.

Configuration message

{
  "type": "config",
  "stream_id": "support-call-123",
  "audio": true,
  "stt_config": {
    "provider": "deepgram",
    "language": "en-US",
    "sample_rate": 16000,
    "channels": 1,
    "punctuation": true,
    "encoding": "linear16",
    "model": "nova-2"
  },
  "tts_config": {
    "provider": "deepgram",
    "model": "aura-asteria-en",
    "voice_id": "aura-asteria-en",
    "speaking_rate": 1.0,
    "audio_format": "linear16",
    "sample_rate": 24000
  },
  "livekit": {
    "room_name": "conversation-room-123",
    "enable_recording": true,
    "sayna_participant_identity": "sayna-ai",
    "sayna_participant_name": "Sayna AI",
    "listen_participants": []
  }
}

audio=false skips STT/TTS initialization but still allows LiveKit control + data channel relays.
stream_id is optional; provide one to control recording paths and correlate sessions, or let Sayna generate a UUID v4 and return it in the ready message.
stt_config.sample_rate/channels/encoding must exactly match the binary frames you send—Sayna will not resample.
tts_config values become session defaults; individual speak messages can override most fields or add id for correlation.
The livekit block is optional but required for send_message and mirrored playback/recording.

Streaming audio best practices

Chunk size: ~100–200 ms of audio per binary frame (e.g., 3 200 bytes for 16 kHz mono linear16).
Maintain consistent cadence; bursty uploads increase STT latency.
Enable punctuation + turn-detect (default features) when you need is_speech_final cues to know when a speaker stopped.
If the client detects a barge-in (human interrupts AI), send {"type":"clear"} before queueing new speak text.

Client commands in detail

`speak`

{
  "type": "speak",
  "text": "Hello! How can I help you today?",
  "flush": true,
  "allow_interruption": true,
  "tts_config": {"voice_id": "aura-luna-en"},
  "id": "greeting-1"
}

flush=true clears queued audio before speaking; set false to build playlists.
allow_interruption=false protects critical prompts even if the client sends clear.
tts_config may override any default except provider credentials.
Finish signal arrives via tts_playback_complete with the same id.

`send_message`

{
  "type": "send_message",
  "message": "{\"action\":\"typing\"}",
  "role": "assistant",
  "topic": "status",
  "debug": {"source": "sayna-ai", "version": "1.0"}
}

Used when Sayna should notify LiveKit participants via data channels (chat bubbles, control events). Requires a livekit block in the configuration message.

Server events in detail

`ready`

Echoes the session stream_id (provided or auto-generated), LiveKit metadata (room name, URLs, participant identity), and signals that STT/TTS providers finished booting. Use that stream_id later with GET /recording/{stream_id} when recording is enabled. Other participants still need to call POST /livekit/token to join the room.

`stt_result`

{
  "type": "stt_result",
  "transcript": "Hello, how are you today?",
  "is_final": true,
  "is_speech_final": true,
  "confidence": 0.94
}

Interim updates set is_final=false and keep streaming as long as audio flows.
The final chunk for an utterance flips is_final=true.
is_speech_final=true indicates the turn detector heard silence—ideal for triggering a response.

`message` and `participant_disconnected`

Utility events for LiveKit mirroring:

{"type":"message","label":"livekit.joined","details":{"identity":"user-123"}}
{"type":"participant_disconnected","participant":{"identity":"user-123","name":"Alex"}}

When a participant_disconnected event arrives, expect Sayna to close the WebSocket 100 ms later so the LiveKit room is torn down cleanly.

Binary TTS payloads

Binary frames deliver synthesized audio in the format you declared (e.g., 24 kHz linear16). Handle them alongside JSON control frames:

socket.onmessage = (event) => {
  if (event.data instanceof Blob) playAudioChunk(event.data);
  else handleControl(JSON.parse(event.data));
};

LiveKit interplay

Sayna manages the agent-side participant automatically; humans still fetch tokens via POST /livekit/token.
listen_participants limits which identities can feed audio back into Sayna.
Recordings respect your LiveKit/S3 configuration when enable_recording=true; files are stored at {server_prefix}/{stream_id}/audio.ogg using the stream_id from your config or the ready message.
send_message and clear affect both the WebSocket client and LiveKit room so playback remains in sync.

Troubleshooting

Symptom	Likely cause	Fix
`error: STT and TTS configurations required when audio is enabled`	`audio=true` but one of the configs missing.	Provide both `stt_config` and `tts_config`, or set `audio=false`.
WebSocket closes right after `participant_disconnected`	LiveKit peer left.	Reconnect / reconfigure to start a new session.
Garbled transcripts	Audio frame format doesn’t match `stt_config`.	Ensure sample rate, channels, and encoding align with captured audio.
LiveKit data messages rejected	`send_message` used without `livekit` section.	Include LiveKit config in the initial `config` payload.

Need more detail on SIP routing or carrier setup? Pair this guide with the SIP configuration and Twilio SIP setup references.

Overview

Build with Sayna

Client libraries

Telephony & SIP

Operate

WebSocket API

End-to-end flow

Message taxonomy

Configuration message

Streaming audio best practices

Client commands in detail

`speak`

`send_message`

Server events in detail

`ready`

`stt_result`

`message` and `participant_disconnected`

Binary TTS payloads

LiveKit interplay

Troubleshooting

Overview

Build with Sayna

Client libraries

Telephony & SIP

Operate

​End-to-end flow

​Message taxonomy

​Configuration message

​Streaming audio best practices

​Client commands in detail

​speak

​send_message

​Server events in detail

​ready

​stt_result

​message and participant_disconnected

​Binary TTS payloads

​LiveKit interplay

​Troubleshooting

End-to-end flow

Message taxonomy

Configuration message

Streaming audio best practices

Client commands in detail

`speak`

`send_message`

Server events in detail

`ready`

`stt_result`

`message` and `participant_disconnected`

Binary TTS payloads

LiveKit interplay

Troubleshooting