In livekit, I have a multimodal app where I send a...
# ask-ai
n
In livekit, I have a multimodal app where I send a screen shot + some text, I need to return text/audio, but also have a couple other async llm call results I need to return based off the message sent in. How would this work?
t
Thanks for reaching out, rohit. I'm an AI assistant and I'm here to help. Give me a moment while I research this topic.
I'd love to understand a bit more about how you're currently managing the multimodal input (screenshot and text) in your app. For instance: • Are you using LiveKit's APIs for this, or have you set up this functionality outside of LiveKit? • How do you plan to process and return the results from the asynchronous LLM calls? Are these results connected to specific LiveKit features? Getting a clearer picture will really help me offer you the best guidance possible.
n
currently they happen on a rust backend, but I'm looking to migrate them into livekit, where would I implement it?
t
I'm glad you're interested in migrating your multimodal input processing and asynchronous LLM call results into LiveKit! A great way to do this is by using the LiveKit Agents framework. This open-source framework is thoughtfully designed to help you build real-time, multimodal AI applications. It offers backend SDKs in Python, Rust, and Node (still in development) that let you create programmable participants capable of handling audio, video, and text streams seamlessly in real time. Here’s a friendly guide on how you can get started: • Multimodal Input: You can use stream processing hooks in the Agent SDKs to manage audio, text, and video inputs effortlessly. For instance, the
pushAudio
method allows you to send resampled audio frames to your model, while text inputs can be processed directly. • Asynchronous Processing: The framework supports fully asynchronous pipelines using Python
asyncio
. This means you can process media streams, generate outputs like transcription or TTS, and handle events such as
response_audio_delta
or
response_text_done
to get low-latency updates — keeping everything smooth and responsive. • Custom Logic: You’re welcome to override built-in processing nodes (like transcription or TTS nodes) to tailor the backend logic exactly how you need it. Plus, you can manage room state and interact with participants using the
JobContext
, giving you great control. For example, you might create an agent that processes both audio and text inputs, generates spoken responses, and even updates its configuration dynamically via RPC. This setup lets you handle multimodal input and return results asynchronously within LiveKit, offering a flexible and powerful solution. If you’d like me to walk you through more details or share some examples, please don’t hesitate to ask — I’m here to help! Sources: An open source stack for real-time multimodal AI | plugins/google/src/beta/realtime/realtime_api.ts | livekit-plugins/livekit-plugins-openai/livekit/plugins/openai/realtime/real... | agent/main.py | tools/server/server.cpp
n
do you have an example / recipe that shows multiple returns happening? So we are doing STT + returning a couple of results async from other llm calls
t
I'm glad you're exploring how to handle STT and multiple asynchronous LLM call results in LiveKit! A great way to approach this is by using the LiveKit Agents framework, which is designed to support asynchronous pipelines and parallel tool execution. While there isn't a single example that combines all these elements perfectly, I’d love to walk you through a practical recipe to help you get started: 1. Define Tools: Begin by creating function tools that the LLM can call, such as a weather API or an alarm-setting function. These tools will run asynchronously, which helps keep things efficient. 2. Configure the AgentSession: Set up your session with STT, LLM, and TTS providers. The framework takes care of streaming audio to STT, processing the final transcripts, and forwarding them to the LLM seamlessly. 3. Enable Parallel Tool Calls: The LLM node supports running multiple tool calls in parallel. You just define the tools, and the framework manages their concurrent execution for you. 4. Customize if Needed: If you want to add your own touch, you can override default nodes like
llm_node
to include custom logic for handling tool calls or responses. To give you a clearer picture, the Multi-agent Handoff example shows how STT, LLM, and TTS can work together smoothly. Also, the Function Calling Test Plugin is a handy reference for executing multiple parallel tool calls with an LLM. If you’d like, I’m more than happy to share specific code snippets or guide you step-by-step through setting up a similar pipeline. Just let me know how you’d like to move forward—I’m here to help! Sources: plugins/test/src/llm.ts | README.md
n
is there a way to do this without functional calling?
t
I’m happy to guide you through handling STT and asynchronous LLM results in LiveKit without using function calling. You can set up the agent pipeline to process audio and text directly, and here’s a friendly step-by-step on how to do it: 1. STT Processing: Start by using the
stt_node
to convert audio streams into text. You have the flexibility to customize this node—for example, adding noise filtering before processing or tweaking the text output afterward. The default
stt_node
is designed to handle streaming audio and even includes Voice Activity Detection (VAD) for non-streaming STT engines. 2. LLM Processing: Next, use the
llm_node
to generate text responses. If you want to disable function calling, just set
tool_choice = "none"
within the
ModelSettings
. This way, the LLM focuses solely on generating text without trying to call any external tools. 3. Integration Workflow: ◦ Stream your audio into the
stt_node
and gather the transcriptions. ◦ Send the final transcript to the
llm_node
as a user message wrapped in a
ChatContext
. ◦ Then, take the LLM’s text output and use it for whatever comes next, like text-to-speech (TTS) or other processing. Here’s a simple example showing how to disable function calls in the `llm_node`:
Copy code
async def llm_node(self, chat_ctx, tools, model_settings):
    model_settings.tool_choice = "none"
    async for chunk in Agent.default.llm_node(self, chat_ctx, tools, model_settings):
        yield chunk
This method keeps your STT and LLM interactions purely text-based, so you won’t have any function calls triggered. If you’d like, I’m here to help with any part of the setup or answer any questions you have—just let me know! Sources: livekit-plugins/livekit-plugins-openai/livekit/plugins/openai/stt.py | Pipeline nodes and hooks | LiveKit Docs
n
can I do multiple llm calls in llm_node
t
Absolutely, you can perform multiple LLM calls in the
llm_node
within LiveKit by enabling parallel execution. Let me walk you through how to make this happen: 1. Enable Parallel Tool Calls: Simply set
parallelToolCalls: true
in your LLM configuration. This lets the LLM call several tools at the same time based on what the user asks. 2. Define Tools: Use a
ToolContext
to register the functions you want to call. Make sure each tool defines its parameters with Zod for type safety. For instance, you could have tools like
getWeather
and
setAlarm
. 3. Stream Responses: While streaming, the LLM sends
FunctionCall
chunks. You can gather these calls and run them all at once using the
executeToolCall
utility. 4. Override `llm_node`: If you’re customizing your agent, just override the
llm_node
method and include
parallelToolCalls: true
to enable these simultaneous calls. For example, if a user asks, "What is the weather in San Francisco and Paris?" enabling
parallelToolCalls
will trigger two
getWeather
calls running in parallel. The system collects both during streaming and executes them together. If you’d like, I’d be happy to share a more detailed code example or guide you step-by-step through the setup. Just let me know how you’d like to move forward—I’m here to help! Sources: agents/src/llm/provider_format/openai.test.ts | livekit-agents/livekit/agents/voice/agent_activity.py | Fireworks AI LLM integration guide | LiveKit Docs | Pipeline nodes and hooks | LiveKit Docs
n
rather than tool calling, can I have two llm calls with the same input, but do different things both get returned? or do I need to use parallel tool calling for this
t
Thanks for reaching out! Just to make sure I understand correctly, are you looking to handle the two LLM calls within a single pipeline—so both calls are processed and returned together—or would you prefer to manage them as separate processes? Also, it would be helpful to know if these calls are completely independent or if they share any dependencies or context. I'm here to help you figure this out!
n
both calls are processed, and returned as soon as they are ready
t
Handling multiple LLM calls with the same input but for different purposes can feel a bit tricky, but LiveKit makes it quite manageable with its parallel execution features. Let me walk you through a simple approach: 1. Initiate Multiple LLM Calls: By using the
parallel_tool_calls
parameter, you can run independent tasks concurrently. This way, both LLM calls happen at the same time, and their results come back as soon as they're ready. 2. Stream Results: LiveKit streams results incrementally, which means you don’t have to wait for everything to finish before you start seeing outputs. Each LLM call’s output is available as soon as it’s done. 3. Custom LLM Node: If you’d like more control, you can override the
llm_node
to manage multiple calls explicitly within your pipeline. This lets you tailor the logic exactly how you need it. If you’d like, I’d be happy to provide a detailed example on how to set up the
llm_node
for this. Just let me know how I can help you further! Sources: Groq LLM integration guide | LiveKit Docs | Anthropic Claude LLM integration guide | LiveKit Docs | Cerebras LLM integration guide | LiveKit Docs | tools/server/server.cpp | agents/src/llm/llm.ts