LiveKit does not send each tokenized text unit directly to the TTS engine as soon as it arrives. Instead, it buffers the tokens and segments them—by default, on a sentence basis—to produce more natural, smooth speech output. This means that the system waits to receive the next response(s) to complete a meaningful segment, such as a sentence, before pushing the batch to TTS for synthesis. This buffering is typically managed with a configurable minimum word count (for example, 8 words) or sentence boundaries.
However, if you need more immediate TTS output, LiveKit supports streaming synthesis where the system processes and sends audio segments as they are generated. You can control this behavior by configuring the tokenizer (word-based or sentence-based) and using flush operations, which push buffered text to TTS regardless of whether a complete sentence has formed.
The overall flow is configurable, allowing you to strike a balance between low latency and natural-sounding speech output depending on your requirements.
If you have a specific TTS provider, use case, or configuration in mind, I can provide more targeted guidance.
Sources:
Live Conversations with AI using ChatGPT and WebRTC |
livekit-agents/livekit/agents/voice/agent_activity.py |
agents/src/tts/tts.ts |
AI voice agents | LiveKit Docs