How would I go about making a TTS node that only w...
# ask-ai
w
How would I go about making a TTS node that only works via session.say (i.e does not auto-say the LLM output)? Or would it be easier to make a custom LLM node that does not stream its output into the TTS node?
t
Thanks for reaching out, Drew. I'm an AI assistant and I'm here to help. Give me a moment while I research this topic.
To create a TTS node that only works via
session.say()
and does not auto-say LLM output, you can override the
tts_node
method in your Agent class. This allows you to control when the TTS node processes text, ensuring it only triggers through explicit calls to
session.say()
. Here’s a high-level approach: 1. Extend the Agent Class: Create a subclass of
Agent
and override the
tts_node
method. 2. Modify the TTS Node: In the overridden
tts_node
, add logic to process text only when explicitly called via
session.say()
. 3. Configure the Agent: Set up your Agent with the necessary components (STT, LLM, TTS, and VAD). 4. Implement Session Control: Use
session.say()
to trigger TTS output and
session.generate_reply()
for LLM responses that should not be spoken automatically. This approach gives you fine-grained control over when the TTS node processes text, allowing you to separate manual and automatic responses in your Agent. Let me know if you need further details or code examples!
w
what is the high level approach to implement an LLM node that does not stream its output to the tts node
t
If you want to set up an LLM node that doesn’t stream its output to the TTS node, a great way to do this is by overriding the
llm_node
method in your custom Agent class. This lets you catch and tweak the LLM output before it reaches the TTS node—or even stop it from streaming altogether. Here’s a simple way to think about it: 1. Create a Custom Agent Class: Start by subclassing the base
Agent
class. 2. Override the
llm_node
Method:
This is where you can intercept the LLM output stream. 3. Modify or Process the Output: Feel free to adjust the LLM output however you need before yielding it. 4. Prevent Streaming to TTS: After you’re done processing, raise a
StopResponse
exception to stop the output from going to the TTS node. This way, you have full control over what the LLM outputs and can make sure it doesn’t automatically stream to the TTS node. If you’d like, I’m happy to share more details or some example code to help you out!
w
whats the difference between agent and agentsession
t
Hey there! Just wanted to break down the difference between the Agent and AgentSession in LiveKit for you: • Agent: This is where the AI’s personality and smarts live. It includes all the tools, instructions, and functions that shape how the agent chats with users and gets things done. • AgentSession: Think of this as the session’s manager. It kicks things off and keeps everything running smoothly by handling components like STT, TTS, VAD, and LLM, plus managing the connection to the room and media streams. So, simply put, the Agent is all about what the agent does, while the AgentSession looks after how everything works during a session. If you want me to explain anything more, just give me a shout! Sources: livekit-agents/livekit/agents/voice/__init__.py
w
what is the difference between say and generate_reply methods?
t
Hey there! Just wanted to help clarify the difference between the say and generate_reply methods in LiveKit: • say(): This one’s great for predefined messages. It takes your text and turns it into speech using TTS or plays pre-recorded audio. Perfect for those steady, reliable responses like greetings or notifications. • generate_reply(): This method is a bit more dynamic—it creates responses on the fly using the Language Model (LLM) based on what the user says or instructions you give. Then, it converts that response to speech using TTS or a realtime model. It’s awesome for interactive, context-aware chats. Both will give you a
SpeechHandle
object so you can keep track of what your agent is saying. So, think of say() for fixed messages and generate_reply() when you want something more flexible and conversational. If you want me to explain anything else, just let me know—I’m here to help!
w
show me in the code where the llm output is streamed into the tts.
t
Just to give you a quick overview: the LLM output is streamed into the TTS inside the
perform_tts_inference
function found in
livekit/agents/voice/generation.py
. Here’s the flow: 1. LLM Output Generation: The
perform_llm_inference
function takes your input, runs it through the LLM, and sends text chunks over the
text_ch
channel. 2. TTS Processing: Then, the
perform_tts_inference
function picks up that text stream from
text_ch
and uses the TTS node to turn it into audio. 3. Audio Frame Streaming: Finally, the audio frames get sent through the
audio_ch
channel so they can be played back. If you want me to walk you through any part in more detail or share some code snippets, just let me know—I’m happy to help!