Skip to main content
Sayna’s /ws endpoint is the control plane for every real-time experience. A single socket handles streaming audio, transcription events, TTS playback, and LiveKit data. This page distills the full protocol from the engineering brief (tmp/docs/websocket.mdoc) into actionable reference material.

End-to-end flow

Client ──► Connect /ws
          └─► send {"type":"config", ...}
Sayna  ──► validate + spin up providers + optional LiveKit join
Sayna  ──► send {"type":"ready", ...}
Client ◄── start streaming audio / send commands
Client & Sayna ── keep exchanging JSON control frames + binary audio
Either side closes socket → Sayna tears down providers + LiveKit room
1

Handshake

Open a WS connection (ws://host:3001/ws or wss://). Ping/pong is automatic; idle sockets close after ~10 s of silence.
2

Config

The very first client message must be type: "config". Sayna injects credentials, boots STT/TTS providers, provisions LiveKit (if configured), warms caches, and responds with ready once every subsystem is up (30 s timeout).
3

Active session

Stream binary PCM/Opus frames, queue speak jobs, drive the loading indicator with loading_start / loading_stop, forward LiveKit data via send_message, and react to stt_result, message, participant_disconnected, and binary TTS audio frames.
4

Cleanup

On close (client, timeout, or LiveKit departure) Sayna halts providers, stops recordings, leaves/deletes the room, flushes caches, and releases file handles.

Message taxonomy

DirectionTypePurpose
Client → ServerconfigDeclare STT/TTS/livekit setup; must come first.
Binary audioRaw PCM/Opus chunks that match stt_config.
speakQueue TTS synthesis (optionally override defaults per call).
clearFlush TTS buffers (WebSocket + LiveKit) unless playback is locked.
loading_startBegin looping the configured loading_audio clip on a dedicated loading-audio LiveKit track.
loading_stopStop the loading loop with a short fade-out. Idempotent and silent server-side.
send_messagePublish JSON payloads over the LiveKit data channel.
sip_transferTransfer the current SIP call to another phone number.
update_configDynamically adjust VAD and turn detection parameters.
Server → ClientreadyProviders + LiveKit are initialized; safe to stream.
stt_resultInterim/final transcripts with is_final / is_speech_final.
Binary audioTTS playback chunks in the negotiated format.
messageInformational notices (LiveKit state, provider hints, etc.).
tts_playback_completeSignals a speak request finished emitting audio.
participant_connectedLiveKit peer joined the room.
participant_disconnectedLiveKit peer left → Sayna closes the socket after 100 ms.
track_subscribedSayna subscribed to a participant’s media track.
vad_eventVAD activity change: speech start, silence, resume, or turn end.
errorValidation/provider issues; connection usually stays open so you can retry.
sip_transfer_errorSIP transfer operation failed.

Configuration message

{
  "type": "config",
  "stream_id": "support-call-123",
  "audio": true,
  "stt_config": {
    "provider": "deepgram",
    "language": "en-US",
    "sample_rate": 16000,
    "channels": 1,
    "punctuation": true,
    "encoding": "linear16",
    "model": "nova-3",
    "auth": {
      "api_key": "your-deepgram-key"
    }
  },
  "tts_config": {
    "provider": "elevenlabs",
    "model": "eleven_flash_v2_5",
    "voice_id": "voice_id_here",
    "speaking_rate": 1.0,
    "audio_format": "linear16",
    "sample_rate": 24000,
    "auth": {
      "api_key": "your-elevenlabs-key"
    }
  },
  "livekit": {
    "room_name": "conversation-room-123",
    "enable_recording": true,
    "sayna_participant_identity": "sayna-ai",
    "sayna_participant_name": "Sayna AI",
    "listen_participants": []
  },
  "loading_audio": {
    "data": "UklGRiQAAABXQVZFZm10IBAAAAABAAEA...",
    "format": "wav",
    "volume": 0.3
  }
}
  • audio=false skips STT/TTS initialization but still allows LiveKit control + data channel relays.
  • stream_id is optional; provide one to control recording paths and correlate sessions, or let Sayna generate a UUID v4 and return it in the ready message.
  • stt_config.sample_rate/channels/encoding must exactly match the binary frames you send—Sayna will not resample.
  • tts_config values become session defaults; individual speak messages can override most fields or add id for correlation.
  • stt_config.auth and tts_config.auth are optional objects that supply provider credentials for this session only. When provided, they fully replace server-configured credentials (no partial merge). Omit or send {} to use server defaults. See Provider auth overrides below for supported shapes per provider.
  • The livekit block is optional but required for send_message and mirrored playback/recording.
  • loading_audio is optional and additive; it is only processed when audio=true and the livekit block is present, and a decode failure is non-fatal (the session continues without loading capability). See the dedicated Loading indicator section below.

Streaming audio best practices

  • Chunk size: ~100–200 ms of audio per binary frame (e.g., 3 200 bytes for 16 kHz mono linear16).
  • Maintain consistent cadence; bursty uploads increase STT latency.
  • Enable punctuation + turn-detect (default features) when you need is_speech_final cues to know when a speaker stopped.
  • If the client detects a barge-in (human interrupts AI), send {"type":"clear"} before queueing new speak text.

Client commands in detail

speak

{
  "type": "speak",
  "text": "Hello! How can I help you today?",
  "flush": true,
  "allow_interruption": true,
  "tts_config": {"voice_id": "aura-luna-en"},
  "id": "greeting-1"
}
  • flush=true clears queued audio before speaking; set false to build playlists.
  • allow_interruption=false protects critical prompts even if the client sends clear.
  • tts_config may override any session default, including provider credentials via the auth field.
  • Finish signal arrives via tts_playback_complete with the same id.

loading_start

{ "type": "loading_start" }
Begins the loading-indicator audio loop on the dedicated "loading-audio" LiveKit track. Fire-and-forget; idempotent if the loop is already running. Requires audio=true, an active LiveKit room, and a loading_audio clip supplied in the initial config message. Failures (no config yet, audio disabled, no LiveKit room, missing or undecodable loading_audio, track publish failure) surface as error messages on the existing error channel.

loading_stop

{ "type": "loading_stop" }
Stops the loading loop with a short fade-out (~30 ms). Fire-and-forget, idempotent, and always silent server-side — stopping when nothing is playing or when no LiveKit room exists is a harmless no-op, never an error. Because speak does not stop the loop, applications must send loading_stop before (or together with) speak in the normal “thinking finished, now answer” flow. Otherwise the loading sound continues playing under the spoken answer.

send_message

{
  "type": "send_message",
  "message": "{\"action\":\"typing\"}",
  "role": "assistant",
  "topic": "status",
  "debug": {"source": "sayna-ai", "version": "1.0"}
}
Used when Sayna should notify LiveKit participants via data channels (chat bubbles, control events). Requires a livekit block in the configuration message.

Server events in detail

ready

Echoes the session stream_id (provided or auto-generated), LiveKit metadata (room name, URLs, participant identity), and signals that STT/TTS providers finished booting. Use that stream_id later with GET /recording/{stream_id} when recording is enabled. Other participants still need to call POST /livekit/token to join the room.

stt_result

{
  "type": "stt_result",
  "transcript": "Hello, how are you today?",
  "is_final": true,
  "is_speech_final": true,
  "confidence": 0.94
}
  • Interim updates set is_final=false and keep streaming as long as audio flows.
  • The final chunk for an utterance flips is_final=true.
  • is_speech_final=true indicates the turn detector heard silence—ideal for triggering a response.

message and participant_disconnected

Utility events for LiveKit mirroring:
{"type":"message","label":"livekit.joined","details":{"identity":"user-123"}}
{"type":"participant_disconnected","participant":{"identity":"user-123","name":"Alex"}}
When a participant_disconnected event arrives, expect Sayna to close the WebSocket 100 ms later so the LiveKit room is torn down cleanly.

Binary TTS payloads

Binary frames deliver synthesized audio in the format you declared (e.g., 24 kHz linear16). Handle them alongside JSON control frames:
socket.onmessage = (event) => {
  if (event.data instanceof Blob) playAudioChunk(event.data);
  else handleControl(JSON.parse(event.data));
};

Provider auth overrides

Both stt_config and tts_config accept an optional auth object that lets clients supply provider credentials per WebSocket session or per POST /speak request. This is useful for multi-tenant deployments, customer-managed provider keys, and per-request credential routing.

Fallback behavior

  • If auth is omitted, the server uses its configured provider credentials.
  • If auth is sent as an empty object {}, the server also falls back to its configured credentials.

No partial merge

If auth is provided, it must be complete for that provider. The server does not merge a partial auth object with server-side credentials. For example, sending Azure auth with only api_key and no region is invalid.

Supported auth shapes

ProviderShapeRequired fields
Deepgram{ "api_key": "..." }api_key
ElevenLabs{ "api_key": "..." }api_key
Cartesia{ "api_key": "..." }api_key
Google Cloud{ "credentials": "..." }credentials (file path string or JSON service account object)
Azure Speech{ "api_key": "...", "region": "..." }api_key, region

Validation

Clients should expect session or request initialization to fail if:
  • Required auth fields for the selected provider are missing.
  • Auth field types are invalid.
  • Unsupported auth keys are sent for that provider.
  • The provider name is unsupported.
The provider value in stt_config.provider or tts_config.provider determines how the auth object is interpreted — the auth object itself does not include a provider discriminator.

Loading indicator

The loading indicator is a short audio clip looped into the LiveKit room while your application is busy (“thinking”) and cannot yet answer. It is the audio equivalent of a spinner. The clip plays on its own dedicated LiveKit audio track named "loading-audio", separate from the "tts-audio" speech track, so it never interferes with text-to-speech output.

loading_audio configuration

Include a loading_audio object in the initial config message. The server decodes and validates the clip once at config time and holds the validated buffer for the session.
FieldTypeRequiredDescription
datastringYesBase64-encoded audio bytes — a complete WAV file or raw 16-bit little-endian PCM.
format"wav" | "pcm"NoContainer hint. If omitted, the server auto-detects from the RIFF/WAVE signature.
sample_rateintegerConditionalSample rate in Hz (8000–48000). Required for raw PCM; ignored for WAV (the header is authoritative).
channelsintegerNo1 (mono) or 2 (stereo). Defaults to 1. Ignored for WAV.
volumenumberNoPlayback volume 0.01.0. Default 1.0; out-of-range values are clamped; applied once at config time.
Audio requirements
  • Only 16-bit PCM is supported (16-bit integer WAV, or raw 16-bit little-endian PCM). Other bit depths are rejected.
  • Decoded duration: ~250 ms to ~10 s, 1–4 s recommended.
  • Decoded size is capped at ~1.9 MB.
  • Author the clip so its start and end amplitudes match — otherwise the loop seam will click.
  • Author the clip quiet enough to sit under speech rather than compete with it.
Behavioral rules
  • loading_audio is processed only when audio=true and a livekit block is present in the same config message. Supplied without those, the server emits an error and continues the session without loading capability.
  • Decode or validation failures are non-fatal — the session continues, STT/TTS are unaffected, and only the loading feature is unavailable. A later loading_start re-reports the original decode error so the failure is never silent.

Controlling the loop

The loop is controlled exclusively by the loading_start and loading_stop commands. speak and clear do not affect it. Server-side, the loop also stops automatically on session teardown / disconnect. The server does not stop the loop when VAD detects user speech. If you want the loading audio to end on barge-in, send loading_stop from your vad_event or stt_result handler.

Typical turn

client → stt_result (final, is_speech_final=true)
client → {"type": "loading_start"}      # background work begins
…application calls its LLM/tools…
client → {"type": "loading_stop"}        # background work ends — stop BEFORE speak
client → {"type": "speak", "text": "Here is what I found…"}

LiveKit interplay

  • Sayna manages the agent-side participant automatically; humans still fetch tokens via POST /livekit/token.
  • listen_participants limits which identities can feed audio back into Sayna.
  • Recordings respect your LiveKit/S3 configuration when enable_recording=true; files are stored at {server_prefix}/{stream_id}/audio.ogg using the stream_id from your config or the ready message.
  • send_message and clear affect both the WebSocket client and LiveKit room so playback remains in sync.

Troubleshooting

SymptomLikely causeFix
error: STT and TTS configurations required when audio is enabledaudio=true but one of the configs missing.Provide both stt_config and tts_config, or set audio=false.
WebSocket closes right after participant_disconnectedLiveKit peer left.Reconnect / reconfigure to start a new session.
Garbled transcriptsAudio frame format doesn’t match stt_config.Ensure sample rate, channels, and encoding align with captured audio.
LiveKit data messages rejectedsend_message used without livekit section.Include LiveKit config in the initial config payload.
Loading audio plays under the spoken answerspeak was sent before loading_stop.Send loading_stop immediately before (or together with) speak.
Need more detail on SIP routing or carrier setup? Pair this guide with the SIP configuration and Twilio SIP setup references.