brash-barista-66564
07/25/2025, 11:21 PMsegment
per synthesis and stream multiple audio chunks
within it?
• Or should each audio chunk be its own segment
?
1. Proper Streaming Pattern: For a custom TTS that receives audio data in chunks from an HTTP stream, what’s the recommended approach?
python
# Is this correct?
output_emitter.start_segment(segment_id="synthesis-1")
async for audio_chunk in tts_response_stream:
output_emitter.push(audio_chunk) # Multiple chunks, one segment?
output_emitter.end_segment()
1. Flush Timing: We were waiting for the _FlushSentinel
before starting TTS processing, which caused our 19s delay. Should we:
• Start TTS processing immediately when we have enough text (50+ chars)?
• Keep the same segment open across multiple text inputs until flush?
• Create new segments for each text input?
Current Working Solution:
We now use early processing (start TTS after 50+ chars) with one segment that stays open until flush, but want to confirm this follows LiveKit best practices and how to achieve faster latency and a streaming effect.
Question: What’s the official recommended pattern for streaming custom TTS providers with proper segment/chunk management in LiveKit Agents?
Any code examples or documentation on TTS streaming patterns would be greatly helpful!
Thanks!tall-airline-16756
07/25/2025, 11:21 PMtall-airline-16756
07/25/2025, 11:22 PMoutput_emitter.start_segment(segment_id="synthesis-1")
async for audio_chunk in tts_response_stream:
output_emitter.push(audio_chunk) # Multiple chunks, one segment
output_emitter.end_segment()
• 3. Flush Timing:
◦ Start TTS processing as soon as you have enough text (e.g., 50+ characters).
◦ Keep the same segment open across multiple text inputs until a flush is required.
◦ Create new segments only when logically necessary (e.g., for distinct synthesis requests).
• 4. Additional Best Practices:
◦ Use the AudioEmitter
class to manage audio frames and segments efficiently.
◦ Ensure proper tokenization of input text to handle streaming effectively.
◦ Handle errors and cleanup resources properly to avoid latency issues.
• *Code Example:*Here’s a complete example of the streaming implementation:
class SynthesizeStream(tts.SynthesizeStream):
async def _run(self, output_emitter: tts.AudioEmitter) -> None:
output_emitter.initialize(
request_id=utils.shortuuid(),
sample_rate=self._sample_rate,
num_channels=1,
mime_type="audio/pcm",
stream=True,
)
# Start segment
output_emitter.start_segment(segment_id=utils.shortuuid())
# Process input and push audio chunks
async for audio_chunk in self._synthesize_stream():
output_emitter.push(audio_chunk)
# End segment
output_emitter.end_segment()
This approach ensures proper audio frame timing, segment boundaries, and compatibility with LiveKit's audio processing pipeline. Let me know if you need further clarification or additional examples!
Sources: livekit-plugins/livekit-plugins-google/livekit/plugins/google/tts.py