What's the order of operations for when the agent ...
# ask-ai
w
What's the order of operations for when the agent says something when performing a function_tool call? I'm using generate_reply within my tool calls, but i'm getting lots of double-speaking when doing that. My assumption is that I first say something, the agent decides to do a tool call, calls the tool, everything in the tool executes, the tool call returns, and then the agent generates speech. Is this accurate?
t
Thanks for reaching out, Alex. I'm an AI assistant and I'm here to help. Give me a moment while I research this topic.
The way LiveKit handles agent speech and tool calls is set up to keep things smooth and avoid any double-speaking by taking things one step at a time: 1. LLM Inference: First, the LLM comes up with a response, which might include text and function calls. If there’s text, it goes to TTS to create audio, and any function calls get lined up for the next step. 2. Tool Execution: Once the LLM’s response is done, the queued function calls are carried out. Any output from these tools gets added back into the chat to keep the conversation flowing naturally. 3. Tool Reply Handling: If a tool’s output needs a reply, the chat updates with that info. Then, a new
generate_reply
runs with
tool_choice
set to "none" so no extra tools run, making sure we don’t end up talking over ourselves. This way, everything happens in order, keeping the conversation clear and easy to follow without any repeated or overlapping speech.
w
So the order is: • User speaks • LLM generates both reply and tool call • Reply is passed to TTS • Tool call is executed • Tool call response is returned to the LLM • New reply is generated from the LLM?
t
The sequence of operations in LiveKit goes like this: 1. User Input: We start by capturing and transcribing the user's audio input. Once the user finishes speaking, the agent's
generate_reply
method kicks in with the user's message. 2. LLM Reply Generation: The
generate_reply
method creates a
SpeechHandle
and begins the LLM inference. The LLM then crafts a response, which might include both text and tool calls. 3. Reply to TTS: The text part of the LLM’s response is sent over to TTS to generate the audio. 4. Tool Execution: If there are any tool calls in the LLM’s response, they get executed. The results from these tools are added back into the chat context. 5. Tool Call Response Handling: When the tool execution needs a reply, the system uses those tool results to create a fresh LLM response. 6. Final Reply: This new LLM response is turned into speech and played back to the user. This flow keeps everything smooth and in order, so you won’t run into issues like talking over yourself. Sources: livekit-plugins/livekit-plugins-openai/livekit/plugins/openai/llm.py
You're absolutely right! Sometimes, the LLM in LiveKit might generate just a tool call without an initial reply. Here’s a quick rundown of how it works: 1. The LLM spots when a tool call is needed and creates it without giving an immediate response. 2. The tool runs, and its result comes back to the system. 3. That result gets added to the chat context, prompting the LLM to respond next, now including the tool’s output. This way, the LLM’s reply is spot-on and based on the freshest info from the tool’s execution. Sources: Tool definition and use | LiveKit Docs | livekit-agents/livekit/agents/voice/agent_activity.py