In livekit I have a multimodal app where I send a screen sho LiveKit Community #ask-ai

In livekit, I have a multimodal app where I send a...

nice-room-70150

09/12/2025, 8:33 PM

In livekit, I have a multimodal app where I send a screen shot + some text, I need to return text/audio, but also have a couple other async llm call results I need to return based off the message sent in. How would this work?

tall-airline-16756

09/12/2025, 8:33 PM

Thanks for reaching out, rohit. I'm an AI assistant and I'm here to help. Give me a moment while I research this topic.

tall-airline-16756

09/12/2025, 8:33 PM

I'd love to understand a bit more about how you're currently managing the multimodal input (screenshot and text) in your app. For instance: • Are you using LiveKit's APIs for this, or have you set up this functionality outside of LiveKit? • How do you plan to process and return the results from the asynchronous LLM calls? Are these results connected to specific LiveKit features? Getting a clearer picture will really help me offer you the best guidance possible.

nice-room-70150

09/12/2025, 8:38 PM

currently they happen on a rust backend, but I'm looking to migrate them into livekit, where would I implement it?

tall-airline-16756

09/12/2025, 8:39 PM

I'm glad you're interested in migrating your multimodal input processing and asynchronous LLM call results into LiveKit! A great way to do this is by using the LiveKit Agents framework. This open-source framework is thoughtfully designed to help you build real-time, multimodal AI applications. It offers backend SDKs in Python, Rust, and Node (still in development) that let you create programmable participants capable of handling audio, video, and text streams seamlessly in real time. Here’s a friendly guide on how you can get started: • Multimodal Input: You can use stream processing hooks in the Agent SDKs to manage audio, text, and video inputs effortlessly. For instance, the

pushAudio

method allows you to send resampled audio frames to your model, while text inputs can be processed directly. • Asynchronous Processing: The framework supports fully asynchronous pipelines using Python

asyncio

. This means you can process media streams, generate outputs like transcription or TTS, and handle events such as

response_audio_delta

response_text_done

to get low-latency updates — keeping everything smooth and responsive. • Custom Logic: You’re welcome to override built-in processing nodes (like transcription or TTS nodes) to tailor the backend logic exactly how you need it. Plus, you can manage room state and interact with participants using the

JobContext

, giving you great control. For example, you might create an agent that processes both audio and text inputs, generates spoken responses, and even updates its configuration dynamically via RPC. This setup lets you handle multimodal input and return results asynchronously within LiveKit, offering a flexible and powerful solution. If you’d like me to walk you through more details or share some examples, please don’t hesitate to ask — I’m here to help! Sources: An open source stack for real-time multimodal AI | plugins/google/src/beta/realtime/realtime_api.ts | livekit-plugins/livekit-plugins-openai/livekit/plugins/openai/realtime/real... | agent/main.py | tools/server/server.cpp

nice-room-70150

09/12/2025, 8:41 PM

do you have an example / recipe that shows multiple returns happening? So we are doing STT + returning a couple of results async from other llm calls

tall-airline-16756

09/12/2025, 8:42 PM

I'm glad you're exploring how to handle STT and multiple asynchronous LLM call results in LiveKit! A great way to approach this is by using the LiveKit Agents framework, which is designed to support asynchronous pipelines and parallel tool execution. While there isn't a single example that combines all these elements perfectly, I’d love to walk you through a practical recipe to help you get started: 1. Define Tools: Begin by creating function tools that the LLM can call, such as a weather API or an alarm-setting function. These tools will run asynchronously, which helps keep things efficient. 2. Configure the AgentSession: Set up your session with STT, LLM, and TTS providers. The framework takes care of streaming audio to STT, processing the final transcripts, and forwarding them to the LLM seamlessly. 3. Enable Parallel Tool Calls: The LLM node supports running multiple tool calls in parallel. You just define the tools, and the framework manages their concurrent execution for you. 4. Customize if Needed: If you want to add your own touch, you can override default nodes like

llm_node

to include custom logic for handling tool calls or responses. To give you a clearer picture, the Multi-agent Handoff example shows how STT, LLM, and TTS can work together smoothly. Also, the Function Calling Test Plugin is a handy reference for executing multiple parallel tool calls with an LLM. If you’d like, I’m more than happy to share specific code snippets or guide you step-by-step through setting up a similar pipeline. Just let me know how you’d like to move forward—I’m here to help! Sources: plugins/test/src/llm.ts | README.md

nice-room-70150

09/12/2025, 8:44 PM

is there a way to do this without functional calling?

tall-airline-16756

09/12/2025, 8:44 PM

I’m happy to guide you through handling STT and asynchronous LLM results in LiveKit without using function calling. You can set up the agent pipeline to process audio and text directly, and here’s a friendly step-by-step on how to do it: 1. STT Processing: Start by using the

stt_node

to convert audio streams into text. You have the flexibility to customize this node—for example, adding noise filtering before processing or tweaking the text output afterward. The default

stt_node

is designed to handle streaming audio and even includes Voice Activity Detection (VAD) for non-streaming STT engines. 2. LLM Processing: Next, use the

llm_node

to generate text responses. If you want to disable function calling, just set

tool_choice = "none"

within the

ModelSettings

. This way, the LLM focuses solely on generating text without trying to call any external tools. 3. Integration Workflow: ◦ Stream your audio into the

stt_node

and gather the transcriptions. ◦ Send the final transcript to the

llm_node

as a user message wrapped in a

ChatContext

. ◦ Then, take the LLM’s text output and use it for whatever comes next, like text-to-speech (TTS) or other processing. Here’s a simple example showing how to disable function calls in the `llm_node`:

Copy code

async def llm_node(self, chat_ctx, tools, model_settings):
    model_settings.tool_choice = "none"
    async for chunk in Agent.default.llm_node(self, chat_ctx, tools, model_settings):
        yield chunk

This method keeps your STT and LLM interactions purely text-based, so you won’t have any function calls triggered. If you’d like, I’m here to help with any part of the setup or answer any questions you have—just let me know! Sources: livekit-plugins/livekit-plugins-openai/livekit/plugins/openai/stt.py | Pipeline nodes and hooks | LiveKit Docs

nice-room-70150

09/12/2025, 8:46 PM

can I do multiple llm calls in llm_node

tall-airline-16756

09/12/2025, 8:47 PM

Absolutely, you can perform multiple LLM calls in the

llm_node

within LiveKit by enabling parallel execution. Let me walk you through how to make this happen: 1. Enable Parallel Tool Calls: Simply set

parallelToolCalls: true

in your LLM configuration. This lets the LLM call several tools at the same time based on what the user asks. 2. Define Tools: Use a

ToolContext

to register the functions you want to call. Make sure each tool defines its parameters with Zod for type safety. For instance, you could have tools like

getWeather

and

setAlarm

. 3. Stream Responses: While streaming, the LLM sends

FunctionCall

chunks. You can gather these calls and run them all at once using the

executeToolCall

utility. 4. Override `llm_node`: If you’re customizing your agent, just override the

llm_node

method and include

parallelToolCalls: true

to enable these simultaneous calls. For example, if a user asks, "What is the weather in San Francisco and Paris?" enabling

parallelToolCalls

will trigger two

getWeather

calls running in parallel. The system collects both during streaming and executes them together. If you’d like, I’d be happy to share a more detailed code example or guide you step-by-step through the setup. Just let me know how you’d like to move forward—I’m here to help! Sources: agents/src/llm/provider_format/openai.test.ts | livekit-agents/livekit/agents/voice/agent_activity.py | Fireworks AI LLM integration guide | LiveKit Docs | Pipeline nodes and hooks | LiveKit Docs

nice-room-70150

09/12/2025, 8:59 PM

rather than tool calling, can I have two llm calls with the same input, but do different things both get returned? or do I need to use parallel tool calling for this

tall-airline-16756

09/12/2025, 8:59 PM

Thanks for reaching out! Just to make sure I understand correctly, are you looking to handle the two LLM calls within a single pipeline—so both calls are processed and returned together—or would you prefer to manage them as separate processes? Also, it would be helpful to know if these calls are completely independent or if they share any dependencies or context. I'm here to help you figure this out!

nice-room-70150

09/12/2025, 9:02 PM

both calls are processed, and returned as soon as they are ready

tall-airline-16756

09/12/2025, 9:03 PM

Handling multiple LLM calls with the same input but for different purposes can feel a bit tricky, but LiveKit makes it quite manageable with its parallel execution features. Let me walk you through a simple approach: 1. Initiate Multiple LLM Calls: By using the

parallel_tool_calls

parameter, you can run independent tasks concurrently. This way, both LLM calls happen at the same time, and their results come back as soon as they're ready. 2. Stream Results: LiveKit streams results incrementally, which means you don’t have to wait for everything to finish before you start seeing outputs. Each LLM call’s output is available as soon as it’s done. 3. Custom LLM Node: If you’d like more control, you can override the

llm_node

to manage multiple calls explicitly within your pipeline. This lets you tailor the logic exactly how you need it. If you’d like, I’d be happy to provide a detailed example on how to set up the

llm_node

Open in Slack

Previous Next