Skip to main content
Sayna’s /ws endpoint is the control plane for every real-time experience. A single socket handles streaming audio, transcription events, TTS playback, and LiveKit data. This page distills the full protocol from the engineering brief (tmp/docs/websocket.mdoc) into actionable reference material.

End-to-end flow

Client ──► Connect /ws
          └─► send {"type":"config", ...}
Sayna  ──► validate + spin up providers + optional LiveKit join
Sayna  ──► send {"type":"ready", ...}
Client ◄── start streaming audio / send commands
Client & Sayna ── keep exchanging JSON control frames + binary audio
Either side closes socket → Sayna tears down providers + LiveKit room
1

Handshake

Open a WS connection (ws://host:3001/ws or wss://). Ping/pong is automatic; idle sockets close after ~10 s of silence.
2

Config

The very first client message must be type: "config". Sayna injects credentials, boots STT/TTS providers, provisions LiveKit (if configured), warms caches, and responds with ready once every subsystem is up (30 s timeout).
3

Active session

Stream binary PCM/Opus frames, queue speak jobs, forward LiveKit data via send_message, and react to stt_result, message, participant_disconnected, and binary TTS audio frames.
4

Cleanup

On close (client, timeout, or LiveKit departure) Sayna halts providers, stops recordings, leaves/deletes the room, flushes caches, and releases file handles.

Message taxonomy

DirectionTypePurpose
Client → ServerconfigDeclare STT/TTS/livekit setup; must come first.
Binary audioRaw PCM/Opus chunks that match stt_config.
speakQueue TTS synthesis (optionally override defaults per call).
clearFlush TTS buffers (WebSocket + LiveKit) unless playback is locked.
send_messagePublish JSON payloads over the LiveKit data channel.
Server → ClientreadyProviders + LiveKit are initialized; safe to stream.
stt_resultInterim/final transcripts with is_final / is_speech_final.
Binary audioTTS playback chunks in the negotiated format.
messageInformational notices (LiveKit state, provider hints, etc.).
tts_playback_completeSignals a speak request finished emitting audio.
participant_disconnectedLiveKit peer left → Sayna closes the socket after 100 ms.
errorValidation/provider issues; connection usually stays open so you can retry.

Configuration message

{
  "type": "config",
  "stream_id": "support-call-123",
  "audio": true,
  "stt_config": {
    "provider": "deepgram",
    "language": "en-US",
    "sample_rate": 16000,
    "channels": 1,
    "punctuation": true,
    "encoding": "linear16",
    "model": "nova-2"
  },
  "tts_config": {
    "provider": "deepgram",
    "model": "aura-asteria-en",
    "voice_id": "aura-asteria-en",
    "speaking_rate": 1.0,
    "audio_format": "linear16",
    "sample_rate": 24000
  },
  "livekit": {
    "room_name": "conversation-room-123",
    "enable_recording": true,
    "sayna_participant_identity": "sayna-ai",
    "sayna_participant_name": "Sayna AI",
    "listen_participants": []
  }
}
  • audio=false skips STT/TTS initialization but still allows LiveKit control + data channel relays.
  • stream_id is optional; provide one to control recording paths and correlate sessions, or let Sayna generate a UUID v4 and return it in the ready message.
  • stt_config.sample_rate/channels/encoding must exactly match the binary frames you send—Sayna will not resample.
  • tts_config values become session defaults; individual speak messages can override most fields or add id for correlation.
  • The livekit block is optional but required for send_message and mirrored playback/recording.

Streaming audio best practices

  • Chunk size: ~100–200 ms of audio per binary frame (e.g., 3 200 bytes for 16 kHz mono linear16).
  • Maintain consistent cadence; bursty uploads increase STT latency.
  • Enable punctuation + turn-detect (default features) when you need is_speech_final cues to know when a speaker stopped.
  • If the client detects a barge-in (human interrupts AI), send {"type":"clear"} before queueing new speak text.

Client commands in detail

speak

{
  "type": "speak",
  "text": "Hello! How can I help you today?",
  "flush": true,
  "allow_interruption": true,
  "tts_config": {"voice_id": "aura-luna-en"},
  "id": "greeting-1"
}
  • flush=true clears queued audio before speaking; set false to build playlists.
  • allow_interruption=false protects critical prompts even if the client sends clear.
  • tts_config may override any default except provider credentials.
  • Finish signal arrives via tts_playback_complete with the same id.

send_message

{
  "type": "send_message",
  "message": "{\"action\":\"typing\"}",
  "role": "assistant",
  "topic": "status",
  "debug": {"source": "sayna-ai", "version": "1.0"}
}
Used when Sayna should notify LiveKit participants via data channels (chat bubbles, control events). Requires a livekit block in the configuration message.

Server events in detail

ready

Echoes the session stream_id (provided or auto-generated), LiveKit metadata (room name, URLs, participant identity), and signals that STT/TTS providers finished booting. Use that stream_id later with GET /recording/{stream_id} when recording is enabled. Other participants still need to call POST /livekit/token to join the room.

stt_result

{
  "type": "stt_result",
  "transcript": "Hello, how are you today?",
  "is_final": true,
  "is_speech_final": true,
  "confidence": 0.94
}
  • Interim updates set is_final=false and keep streaming as long as audio flows.
  • The final chunk for an utterance flips is_final=true.
  • is_speech_final=true indicates the turn detector heard silence—ideal for triggering a response.

message and participant_disconnected

Utility events for LiveKit mirroring:
{"type":"message","label":"livekit.joined","details":{"identity":"user-123"}}
{"type":"participant_disconnected","participant":{"identity":"user-123","name":"Alex"}}
When a participant_disconnected event arrives, expect Sayna to close the WebSocket 100 ms later so the LiveKit room is torn down cleanly.

Binary TTS payloads

Binary frames deliver synthesized audio in the format you declared (e.g., 24 kHz linear16). Handle them alongside JSON control frames:
socket.onmessage = (event) => {
  if (event.data instanceof Blob) playAudioChunk(event.data);
  else handleControl(JSON.parse(event.data));
};

LiveKit interplay

  • Sayna manages the agent-side participant automatically; humans still fetch tokens via POST /livekit/token.
  • listen_participants limits which identities can feed audio back into Sayna.
  • Recordings respect your LiveKit/S3 configuration when enable_recording=true; files are stored at {server_prefix}/{stream_id}/audio.ogg using the stream_id from your config or the ready message.
  • send_message and clear affect both the WebSocket client and LiveKit room so playback remains in sync.

Troubleshooting

SymptomLikely causeFix
error: STT and TTS configurations required when audio is enabledaudio=true but one of the configs missing.Provide both stt_config and tts_config, or set audio=false.
WebSocket closes right after participant_disconnectedLiveKit peer left.Reconnect / reconfigure to start a new session.
Garbled transcriptsAudio frame format doesn’t match stt_config.Ensure sample rate, channels, and encoding align with captured audio.
LiveKit data messages rejectedsend_message used without livekit section.Include LiveKit config in the initial config payload.
Need more detail on SIP routing or carrier setup? Pair this guide with the SIP configuration and Twilio SIP setup references.