How to have a different handling for voice message...
# ask-ai
b
How to have a different handling for voice message and text message? code examples of agent and client app?
t
Thanks for reaching out, Rongrong. I'm an AI assistant and I'm here to help. Give me a moment while I research this topic.
I'd love to better understand what you're hoping to achieve when handling voice and text messages differently. Could you share a bit more about your goals? For example, are you interested in: • How to detect and tell apart voice and text messages? • Ways to process or store these messages in different ways? • How to show them differently within a client application? Also, it would be helpful to know if you're looking for examples in a particular programming language or framework. I'm here to help you every step of the way!
b
I would like to handle voice message by agent through the pipeline of stt-llm-tts, and text message by agent directly calling llm and return text. how to implement this?
t
I'm happy to guide you through handling voice messages using an STT-LLM-TTS pipeline and text messages by directly calling an LLM with the
VoicePipelineAgent
class from the LiveKit Agents framework. Let's walk through the steps together: 1. Initialize the VoicePipelineAgent First, you'll want to define your agent with STT, LLM, and TTS models. Here's a friendly example to get you started:
Copy code
# Example using Groq-based STT and LLM with OpenAI TTS
agent = VoicePipelineAgent(
    vad=silero.VAD.load(),  # This is optional but really helpful for managing interruptions
    stt=openai.STT.with_groq(),
    llm=openai.LLM.with_groq(model="llama3-70b-8192"),
    tts=openai.TTS(),
    chat_ctx=initial_context,
    allow_interruptions=True  # This lets your users interrupt speech naturally
)
2. Handle Voice Messages When someone speaks, the agent will process their audio through the STT-LLM-TTS pipeline. You can start the agent in the room like this:
Copy code
agent.start(room, participant)
3. Handle Text Messages If you're working with text messages, you can skip the STT and TTS steps and just call the LLM directly, like so:
Copy code
await agent.say("Hello! How can I assist you?")
4. Customize Pipeline Behavior Feel free to tailor the pipeline to your needs by overriding nodes: • STT Node: You can pre-process or post-process the transcription to fit your use case. • LLM Node: Modify inputs or outputs before or after inference for more control. • TTS Node: Adjust the text before it’s synthesized to sound just right. 5. Additional Features • Utilize lifecycle hooks like
on_user_turn_completed
to manage context effectively. • Keep interruption handling smooth by enabling
allow_interruptions=True
. Please don’t hesitate to ask if you’d like me to provide more details or specific code examples—I’m here to help! Sources: AI voice agents | LiveKit Docs | Pipeline nodes and hooks | LiveKit Docs | Pipeline nodes and hooks | LiveKit Docs | VoicePipelineAgent | LiveKit Docs | chain-of-thought-tts/agent.py
b
can we skip tts node at per message level?
t
Absolutely, you can skip the TTS node at a per-message level in LiveKit's
VoicePipelineAgent
. It’s actually pretty straightforward using the
on_user_turn_completed
lifecycle hook. Here’s a friendly guide on how you can do it: Python Example:
Copy code
async def on_user_turn_completed(
    self, turn_ctx: ChatContext, new_message: ChatMessage,
):
    if not new_message.text_content:
        raise StopResponse()  # Skips LLM and TTS for this message
Node.js Example:
Copy code
async onUserTurnCompleted(
    turnCtx: llm.ChatContext,
    newMessage: llm.ChatMessage,
): Promise<void> {
    if (!newMessage.textContent) {
        throw new voice.StopResponse();  // Skips LLM and TTS for this message
    }
}
This method gives you great control over when the TTS node runs, message by message. If anything feels unclear or you’d like more examples, please don’t hesitate to ask—I’m here to help! Sources: Docs::Agents | Pipeline nodes and hooks | LiveKit Docs
b
but this example skips both LLM and TTS. can we skip TTS node only?
t
To help you skip just the TTS node in LiveKit's
VoicePipelineAgent
while still allowing the LLM to run, you can simply override the
tts_node
method in a custom agent class. Here’s a friendly example to guide you through it: Python Example:
Copy code
from livekit.agents import Agent, ModelSettings
from typing import AsyncIterable

class TTSSkippingAgent(Agent):
    async def tts_node(
        self, text: AsyncIterable[str], model_settings: ModelSettings
    ) -> None:
        # Skip TTS processing by returning None
        return None
This way, the LLM will continue to process and generate text as expected, but the TTS step will be skipped, so no audio will be created. Another option: If returning
None
doesn’t quite fit your setup, you might try returning an empty asynchronous generator instead, like this:
Copy code
async def tts_node(
    self, text: AsyncIterable[str], model_settings: ModelSettings
) -> AsyncIterable[rtc.AudioFrame]:
    async def empty_stream():
        return  # yield nothing
    return empty_stream()
Please feel free to reach out if you'd like me to explain anything further or if you want more examples—I’m here to help! Sources: Pipeline nodes and hooks | LiveKit Docs | Pipeline nodes and hooks | LiveKit Docs | examples/voice_agents/structured_output.py | Agents 0.x migration guide | LiveKit Docs | Agents v0.x migration guide - Node.js | LiveKit Docs
b
how to add a listener for
user_state_changed
t
I'd be happy to help you add a listener for
user_state_changed
in LiveKit! You can do this by using the
AgentSession
object and attaching an event listener. Here’s a simple way to implement it: Node.js Example:
Copy code
import { voice } from '@livekit/agents';

session.on(voice.AgentSessionEventTypes.UserStateChanged, (ev) => {
  if (ev.newState === 'speaking') {
    console.log('User started speaking');
  } else if (ev.newState === 'listening') {
    console.log('User stopped speaking');
  } else if (ev.newState === 'away') {
    console.log('User is not present (e.g. disconnected)');
  }
});
Python Example:
Copy code
from livekit.agents import UserStateChangedEvent, AgentStateChangedEvent

@session.on("user_state_changed")
def on_user_state_changed(ev: UserStateChangedEvent):
  if ev.new_state == "speaking":
    print("User started speaking")
  elif ev.new_state == "listening":
    print("User stopped speaking")
  elif ev.new_state == "away":
    print("User is not present (e.g. disconnected)")
User States:speaking: VAD detected the user has started speaking. • listening: VAD detected the user has stopped speaking. • away: The user hasn’t responded for a while (default: 15 seconds). Also, if you'd like, you can customize the timeout for the "away" state using the
user_away_timeout
parameter in
AgentSession
. Please feel free to reach out if you have any questions or need more help—I’m here for you! Sources: Events and error handling | LiveKit Docs | livekit-agents/livekit/agents/voice/events.py
b
how to disconnect from agent side?
t
Hi there! To gently disconnect an agent from the room in LiveKit, you can use the
ctx.shutdown()
method right within the agent's entrypoint function. This way, the agent will disconnect smoothly, and the other participants can continue their session without any interruptions. Python Example:
Copy code
async def entrypoint(ctx: JobContext):
    # Perform agent tasks...
    
    # Disconnect the agent from the room
    ctx.shutdown(reason="Session ended")
Node.js Example:
Copy code
export default defineAgent({
    entry: async (ctx: JobContext) => {
        // Perform agent tasks...

        // Disconnect the agent from the room
        ctx.shutdown('Session ended');
    },
});
Once you call
ctx.shutdown()
, any registered shutdown hooks will run to handle cleanup tasks like saving state or releasing resources. If you have any questions or need a hand with anything else, just let me know—I’m here to help! Sources: Inside a session | LiveKit Docs | Inside a session | LiveKit Docs | Job lifecycle | LiveKit Docs