Why the text is not sent immediately to TTS LiveKit Community #ask-ai

Join Slack

Why the text is not sent immediately to TTS?

# ask-ai

able-jackal-84140

07/14/2025, 5:35 AM

Why the text is not sent immediately to TTS?

tall-airline-16756

07/14/2025, 5:35 AM

Thanks for reaching out, Aditya. I'm an AI assistant and I'm here to help. Give me a moment while I research this topic.

tall-airline-16756

07/14/2025, 5:36 AM

Text is not sent immediately to TTS (Text-To-Speech) in LiveKit systems due to several intentional design reasons: • Tokenization and Buffering: The system often waits to accumulate a minimum number of words (e.g., 10 words in some implementations) before processing and sending text to TTS. This buffering enables more natural speech synthesis, especially when the TTS provider does not support native streaming capabilities. • Segment-Based and Synchronized Processing: Text is split into larger segments (like sentences or phrases) and not sent character-by-character. Once a segment is ready, a signal marks its boundary and initiates TTS processing, aligning spoken audio with text transcripts. • Latency and Audio Quality Optimization: Some delay is introduced for synchronization between text and speech, improving timing and output quality. There are also extra steps like audio frame processing and network connection management that can add a delay. • Provider-Specific Behaviors: Certain TTS models may have inherent delays and timeouts based on their processing speeds and API characteristics. These mechanisms are intentional to balance responsiveness, output quality, and system reliability for real-time voice agents. Let me know if you need details on a specific TTS provider or further configuration options. Sources: livekit-plugins/livekit-plugins-deepgram/livekit/plugins/deepgram/tts.py | AI voice agents | LiveKit Docs | Live Conversations with AI using ChatGPT and WebRTC | agents/src/pipeline/agent_output.ts

able-jackal-84140

07/14/2025, 5:36 AM

how to change this minimum number of buffer

tall-airline-16756

07/14/2025, 5:37 AM

Hey there! Want to fine-tune how your text-to-speech sounds? You can easily adjust the minimum number of buffered words before the text gets sent to TTS by tweaking the tokenizer configuration when you're setting up the TTS service. • Python: Here's a friendly way to create a custom tokenizer with your preferred buffer value:

Copy code

from livekit.agents import tokenize
from livekit.plugins.google import tts
# For Google TTS (example):
custom_tokenizer = tokenize.basic.SentenceTokenizer(min_sentence_len=5)
tts_service = tts.TTS(tokenizer=custom_tokenizer)

• JavaScript: If you're working with JavaScript, you can set your custom buffer when initializing the tokenizer:

Copy code

// Replace YOUR_CUSTOM_COUNT with your chosen value
#tokenizer = new tokenize.basic.SentenceTokenizer(undefined, YOUR_CUSTOM_COUNT);

• Pro tip: If you want to make a universal change, you can directly modify the

BUFFERED_WORDS_COUNT

constant in the TTS plugin's source code—this sets the buffer size globally for that provider. Just a heads up—buffer size is all about finding that sweet spot. A lower value means less delay but might sound a bit choppy, while a higher value gives you smoother, more natural speech but adds a bit more waiting time. Curious about specifics for your TTS provider or programming language? Just ask, and I'll help you pinpoint the exact file, function, or parameter you need! Sources: livekit-plugins/livekit-plugins-elevenlabs/livekit/plugins/elevenlabs/tts.p... | agents/src/tokenize/token_stream.ts

able-jackal-84140

07/14/2025, 5:39 AM

what is stream context length?

tall-airline-16756

07/14/2025, 5:39 AM

Hey there! I totally get your question about the "stream context length" parameter. While it's not a standard term in LiveKit's TTS docs, I can definitely break down some similar settings that control how audio streaming works. • Let me walk you through some related parameters that might help you out: ◦ Streaming latency: Think of this like a tiny pause before audio starts streaming - usually around 3 seconds. It helps smooth out the delivery. ◦ Chunk length schedule: This is basically how big your audio chunks will be when streaming. Developers typically use values like 80, 120, or 260. ◦ Buffered word count: This is a cool setting that determines how many words get collected before turning them into speech. Usually around 8-10 words, which helps make the audio sound more natural. ◦ Context parameters: Some awesome TTS providers even let you include previous text to make the speech sound super smooth and conversational. If you've got a specific TTS provider or integration you're working with - like ElevenLabs, Hume AI, or Cartesia - just let me know! I'd be happy to dive deeper into the specific parameter details or how they impact speech synthesis. Always here to help! 😊 Sources: ElevenLabs integration guide | LiveKit Docs | livekit-agents/livekit/agents/tts/tts.py | agents/src/tts/tts.ts

able-jackal-84140

07/14/2025, 5:46 AM

How can I change it?

tall-airline-16756

07/14/2025, 5:48 AM

Hey there! Let me break down how you can fine-tune streaming latency and chunk length for TTS in LiveKit. Think of it like adjusting the rhythm of your audio conversation. • For ElevenLabs TTS (they've got some pretty cool advanced settings):

Copy code

from livekit.plugins.elevenlabs import tts

eleven_tts = tts.TTS(
    streaming_latency=3,                # latency in seconds
    chunk_length_schedule=[80,120,200,260],  # chunk sizes
    auto_mode=True,                    # enables lowest latency with full sentences
    # ...other options
)

Pro tip: If you play with lower

streaming_latency

or smaller chunk sizes, you'll get faster audio - but just keep in mind it might make the speech sound a bit less smooth. • For Hume AI TTS:

Copy code

hume_tts = tts.TTS(
    instant_mode=True,  # enables ultra-low latency streaming
    # ...other options
)

• For Google Cloud TTS:

Copy code

google_tts = tts.TTS(
    use_streaming=True,  # enables streaming synthesis
    # ...
)

Other providers have their own special tricks - like Rime's

reduce_latency=True

or Inworld's sample and bit rate tweaks. Each one's a bit unique! Curious about the specifics for your particular setup? Just shoot me a message about your provider or programming language, and I'll help you nail down the perfect configuration. 😊 Sources: TTSOptions | LiveKit Agents | Hume TTS integration guide | LiveKit Docs | ElevenLabs integration guide | LiveKit Docs | livekit-plugins/livekit-plugins-hume/livekit/plugins/hume/tts.py

Open in Slack

Previous Next