How can I stream baseten TTS with baseten plugin o...
# ask-ai
b
How can I stream baseten TTS with baseten plugin or custom tts to the room ? - We’re implementing a custom TTS provider (Baseten) using the LiveKit Agents framework and running into segment management issues that are causing significant latency. Our Problem: • Our custom TTS was taking 19+ seconds instead of expected ~700ms • We were getting “expected 1 segment, got 8” errors, because we were trying a number of ways to create a streaming effect to get the audio chunks faster • We had to restructure to use only one segment with multiple audio chunks streamed within it Current Implementation Questions: 1. Segment vs Chunk Confusion: What’s the correct pattern for streaming TTS audio? • Should we create ONE
segment
per synthesis and stream multiple audio
chunks
within it? • Or should each audio chunk be its own
segment
? 1. Proper Streaming Pattern: For a custom TTS that receives audio data in chunks from an HTTP stream, what’s the recommended approach?
Copy code
python
   # Is this correct?
   output_emitter.start_segment(segment_id="synthesis-1") 
   async for audio_chunk in tts_response_stream:
       output_emitter.push(audio_chunk)  # Multiple chunks, one segment?
   output_emitter.end_segment()
1. Flush Timing: We were waiting for the
_FlushSentinel
before starting TTS processing, which caused our 19s delay. Should we: • Start TTS processing immediately when we have enough text (50+ chars)? • Keep the same segment open across multiple text inputs until flush? • Create new segments for each text input? Current Working Solution: We now use early processing (start TTS after 50+ chars) with one segment that stays open until flush, but want to confirm this follows LiveKit best practices and how to achieve faster latency and a streaming effect. Question: What’s the official recommended pattern for streaming custom TTS providers with proper segment/chunk management in LiveKit Agents? Any code examples or documentation on TTS streaming patterns would be greatly helpful! Thanks!
t
Thanks for reaching out, Andrew. I'm an AI assistant and I'm here to help. Give me a moment while I research this topic.
To implement a custom TTS provider like Baseten using LiveKit Agents, here is the recommended pattern for proper segment and chunk management: • 1. Segment vs Chunk Management: ◦ Create one segment per synthesis and stream multiple audio chunks within it. ◦ Avoid creating a new segment for each audio chunk. Instead, use the same segment and push multiple chunks into it. • *2. Proper Streaming Pattern:*Here’s the correct implementation pattern for streaming TTS audio:
Copy code
output_emitter.start_segment(segment_id="synthesis-1")

async for audio_chunk in tts_response_stream:
    output_emitter.push(audio_chunk)  # Multiple chunks, one segment

output_emitter.end_segment()
3. Flush Timing: ◦ Start TTS processing as soon as you have enough text (e.g., 50+ characters). ◦ Keep the same segment open across multiple text inputs until a flush is required. ◦ Create new segments only when logically necessary (e.g., for distinct synthesis requests). • 4. Additional Best Practices: ◦ Use the
AudioEmitter
class to manage audio frames and segments efficiently. ◦ Ensure proper tokenization of input text to handle streaming effectively. ◦ Handle errors and cleanup resources properly to avoid latency issues. • *Code Example:*Here’s a complete example of the streaming implementation:
Copy code
class SynthesizeStream(tts.SynthesizeStream):
    async def _run(self, output_emitter: tts.AudioEmitter) -> None:
        output_emitter.initialize(
            request_id=utils.shortuuid(),
            sample_rate=self._sample_rate,
            num_channels=1,
            mime_type="audio/pcm",
            stream=True,
        )

        # Start segment
        output_emitter.start_segment(segment_id=utils.shortuuid())

        # Process input and push audio chunks
        async for audio_chunk in self._synthesize_stream():
            output_emitter.push(audio_chunk)

        # End segment
        output_emitter.end_segment()
This approach ensures proper audio frame timing, segment boundaries, and compatibility with LiveKit's audio processing pipeline. Let me know if you need further clarification or additional examples! Sources: livekit-plugins/livekit-plugins-google/livekit/plugins/google/tts.py