can I stream incoming llm response directly to tts...
# ask-ai
f
can I stream incoming llm response directly to tts?
t
Thanks for reaching out, Akshay. I'm an AI assistant and I'm here to help. Give me a moment while I research this topic.
f
??
t
Absolutely, you can stream incoming LLM responses directly to TTS in LiveKit for real-time audio synthesis. This is made possible through the LiveKit Agents framework, which offers both easy-to-use high-level features and detailed control for smooth, natural-sounding voice output with low latency. Let me walk you through two main ways to do this: • Using AgentSession: This is the go-to option for most situations. By setting up an
AgentSession
with a TTS instance, LiveKit takes care of streaming the LLM's text output to the TTS engine in real time. It’s straightforward and requires very little setup. • Standalone TTS Streaming: If you’re looking for more hands-on control, you can manually stream LLM responses to a TTS engine. This way, you get to manage how text chunks are sent and have full control over the synthesis process. If you’d like me to share some detailed examples or if you have any questions along the way, just let me know—I’m here to help!
f
how to add it in code?
t
I'd love to help you add streaming of LLM responses to TTS in your code! You have two great options: using the
AgentSession
for an easy, automatic setup, or going the manual route if you want more control. Here are examples for both to guide you through: • Using
AgentSession
(Recommended):
This is the simplest way, as it automatically streams LLM responses to TTS:
Copy code
from livekit.agents import AgentSession
from livekit.plugins import openai, silero

# Initialize the session
session = AgentSession(
    vad=silero.VAD.load(),
    llm=openai.LLM(model="gpt-4o-mini"),
    tts=openai.TTS(model="tts-1")
)

# Start the session in a room
await session.start(agent=YourAgent(), room=ctx.room)
With this setup, the LLM’s output is seamlessly streamed to the TTS engine, and the synthesized audio reaches all participants in the room. It’s really convenient! • Manual TTS Streaming (For Custom Control): If you want to tailor the streaming process yourself, here’s how you can manually stream text to TTS and publish the audio:
Copy code
from livekit import rtc
from livekit.plugins import cartesia

# Initialize TTS and audio source
tts = cartesia.TTS(model="sonic-english")
audio_source = rtc.AudioSource(tts.sample_rate, tts.num_channels)
track = rtc.LocalAudioTrack.create_audio_track("agent-audio", audio_source)
await ctx.room.local_participant.publish_track(track)

# Stream text to TTS
tts_stream = tts.stream()
async def send_audio():
    async for audio in tts_stream:
        await audio_source.capture_frame(audio.frame)

ctx.create_task(send_audio())

# Push text chunks to TTS
for chunk in ["Hello ", "from LiveKit. ", "How can I help you?"]:
    tts_stream.push_text(chunk)
tts_stream.end_input()
This method gives you full control over the text flow and lets you add any custom logic you might need. If you’d like, I’m here to help with a full working example or to walk you through any part of this. Just let me know!
To help you add streaming of LLM responses to TTS in your code, you have two great options: using the
AgentSession
for an easy, automatic setup, or manually controlling the TTS stream if you want more customization. Let me walk you through both approaches: • Using
AgentSession
(Recommended):
This is the simplest way, as it automatically streams LLM responses to TTS for you: `from livekit.agents import AgentSession from livekit.plugins import openai, silero # Initialize the session session = AgentSession( vad=silero.VAD.load(), llm=openai.LLM(model="gpt-4o-mini"), tts=openai.TTS(model="tts-1") ) # Start the session in a room await session.start(agent=YourAgent(), room=ctx.room)`With this setup, the LLM output is seamlessly streamed to the TTS engine, and the synthesized audio is delivered to everyone in the room. • Manual TTS Streaming (For Custom Control): If you’d like more control over the process, you can manually stream text to the TTS and publish the audio yourself: `from livekit import rtc from livekit.plugins import cartesia # Initialize TTS and audio source tts = cartesia.TTS(model="sonic-english") audio_source = rtc.AudioSource(tts.sample_rate, tts.num_channels) track = rtc.LocalAudioTrack.create_audio_track("agent-audio", audio_source) await ctx.room.local_participant.publish_track(track) # Stream text to TTS tts_stream = tts.stream() async def send_audio(): async for audio in tts_stream: await audio_source.capture_frame(audio.frame) ctx.create_task(send_audio()) # Push text chunks to TTS for chunk in ["Hello ", "from LiveKit. ", "How can I help you?"]: tts_stream.push_text(chunk) tts_stream.end_input()`This approach gives you the flexibility to control the text flow and add any custom logic you might need. If you have any questions or would like me to provide a complete working example, just let me know—I’m here to help!