<https docs livekit io agents v0 voice agent multimodal agen LiveKit Community #ask-ai

<https://docs.livekit.io/agents/v0/voice-agent/mul...

red-agent-48266

09/11/2025, 4:30 PM

https://docs.livekit.io/agents/v0/voice-agent/multimodal-agent/ Could i find the multimodal model for gemini docs in livekit

tall-airline-16756

09/11/2025, 4:30 PM

Thanks for reaching out, Hrithik. I'm an AI assistant and I'm here to help. Give me a moment while I research this topic.

tall-airline-16756

09/11/2025, 4:31 PM

I'm glad you're interested in using LiveKit with Google's Gemini Live API! This integration supports a wonderful multimodal model within the LiveKit Agents framework, allowing you to enjoy smooth, low-latency, two-way interactions using text, audio, and video input. The outputs come in both audio and text, making it perfect for creating voice AI applications that feel natural and human-like. To get started with Gemini, you can easily install the Google plugin through PyPI with this command:

Copy code

pip install "livekit-agents[google]~=1.2"

For authentication, you'll just need to set the appropriate environment variables depending on whether you're using Vertex AI or the Google Gemini API. For instance, setting

GOOGLE_API_KEY

is necessary for the Gemini API. When it comes to implementation, you have the option to use the

RealtimeModel

class for straightforward setup or the

MultimodalAgent

class if you're looking for more advanced control. Here's a simple example to guide you:

Copy code

from livekit.plugins import google

session = AgentSession(
    llm=google.beta.realtime.RealtimeModel(
        model="gemini-2.0-flash-exp",
        voice="Puck",
        temperature=0.8,
        instructions="You are a helpful assistant",
    ),
)

If you'd like to dive deeper, feel free to check out the Google Gemini integration guide or the Gemini Live API documentation. And of course, if you have any questions or need a hand along the way, I'm here to help—just let me know! Sources: Gemini Live API integration guide | LiveKit Docs | Google Gemini integration guide | LiveKit Docs

red-agent-48266

09/11/2025, 4:32 PM

I am interensted in typescript

tall-airline-16756

09/11/2025, 4:32 PM

I'm excited to share that LiveKit supports building multimodal models in TypeScript using the

@livekit/agents

SDK. This lets you create AI agents that can process and generate both audio and text in real time, offering a really dynamic experience. Here’s a simple example of how you can define a multimodal agent in TypeScript:

Copy code

import { JobContext, WorkerOptions, cli, defineAgent, multimodal } from '@livekit/agents';
import * as openai from '@livekit/agents-plugin-openai';

export default defineAgent({
  entry: async (ctx: JobContext) => {
    await ctx.connect();

    const agent = new multimodal.MultimodalAgent({
      model: new openai.realtime.RealtimeModel({
        instructions: "You are a helpful assistant.",
        voice: 'shimmer',
        modalities: ['text', 'audio'],
      }),
    });

    await agent.start(ctx.room);
  },
});

cli.runApp(new WorkerOptions({ agent: import.meta.url, workerType: JobType.JT_ROOM }));

This example uses the

@livekit/agents

npm package, which works seamlessly with TypeScript. You can easily customize the agent to handle text, audio, or both by adjusting the

modalities

parameter. If you have any questions or would like me to walk you through more examples, just let me know—I'm here to help! Sources: AI voice agents | LiveKit Docs | .eslintrc | web/src/components/code-viewer.tsx

tall-airline-16756

09/11/2025, 4:34 PM

I'm excited to share that LiveKit supports building multimodal models in TypeScript using the

@livekit/agents

SDK. This wonderful feature lets you create AI agents that can process and generate both audio and text in real time, making your applications even more interactive and engaging. Here's a simple example of how you can define a multimodal agent in TypeScript:

Copy code

import { JobContext, WorkerOptions, cli, defineAgent, multimodal } from '@livekit/agents';
import * as openai from '@livekit/agents-plugin-openai';

export default defineAgent({
  entry: async (ctx: JobContext) => {
    await ctx.connect();

    const agent = new multimodal.MultimodalAgent({
      model: new openai.realtime.RealtimeModel({
        instructions: "You are a helpful assistant.",
        voice: 'shimmer',
        modalities: ['text', 'audio'],
      }),
    });

    await agent.start(ctx.room);
  },
});

cli.runApp(new WorkerOptions({ agent: import.meta.url, workerType: JobType.JT_ROOM }));

This example uses the

@livekit/agents

npm package, which works seamlessly with TypeScript. You have the flexibility to configure modalities like text, audio, or both, and tailor the agent’s behavior to suit your needs. If you have any questions or would like more examples, please don’t hesitate to ask—I’m here to help!

red-agent-48266

09/11/2025, 4:35 PM

does it work with typescript and gemini for multimodal models

tall-airline-16756

09/11/2025, 4:37 PM

Thanks for reaching out! Just to make sure I understand you correctly, are you looking to integrate Google's Gemini API specifically with TypeScript for multimodal models in LiveKit? If that's the case, could you share a bit more about what you're hoping to achieve—like real-time audio and text processing, or maybe a particular use case you have in mind? I'm here to help you through this!

red-agent-48266

09/11/2025, 4:38 PM

i want to make a realtime audio platform whit typescript, gemini and using mulimodal model

tall-airline-16756

09/11/2025, 4:38 PM

I'm glad you're exploring how to create a real-time audio platform using TypeScript, Gemini, and a multimodal model in LiveKit! You can make this happen smoothly by using the

@livekit/agents-plugin-google

package, which helps you integrate Google's Gemini API for real-time audio and text processing. Here’s a simple example to help you get started:

Copy code

import * as google from '@livekit/agents-plugin-google';
import { MultimodalAgent } from '@livekit/agents';

const agent = new MultimodalAgent({
  model: new google.beta.realtime.RealtimeModel({
    model: 'gemini-2.0-flash-exp',
    voice: 'Puck',
    modalities: ['AUDIO', 'TEXT'],
    instructions: 'You are a helpful assistant.',
    apiKey: process.env.GOOGLE_API_KEY,
  }),
});

agent.start(room);

Here’s a friendly step-by-step to guide you through the setup: 1. First, install the necessary packages by running:

npm install @livekit/agents @livekit/agents-plugin-google

. 2. Next, make sure to set up your authentication by configuring the

GOOGLE_API_KEY

environment variable. 3. Then, use the

RealtimeModel

class to configure the Gemini model, specifying

['AUDIO', 'TEXT']

for multimodal support. 4. Finally, start the agent in a LiveKit room to enable real-time audio interactions. This setup will empower you to process audio input and generate real-time audio and text responses using Gemini’s multimodal capabilities. If you have any questions or want me to walk you through more examples, just let me know—I’m here to help!

red-agent-48266

09/13/2025, 12:09 PM

how to use session const session = await agent ._start_(ctx.room, participant) ._then_((session) => session as openai.realtime.RealtimeSession);

tall-airline-16756

09/13/2025, 12:09 PM

To get started with a session using

RealtimeSession

in LiveKit TypeScript, here’s a simple guide to walk you through the process: 1. Initialize the RealtimeModel: Begin by creating an instance of

RealtimeModel

with your preferred settings—this could include instructions, voice options, or any model-specific configurations you want to use. 2. Set up the Agent: Next, use

multimodal.MultimodalAgent

with your initialized model. You can also add an optional tool context if you want to enable function calling. 3. Start the Session: Kick off the session by calling

agent.start(room, participant)

, and then type-cast the result to

openai.realtime.RealtimeSession

like this:

Copy code

const session = await agent
  .start(ctx.room, participant)
  .then((session) => session as openai.realtime.RealtimeSession);

1. Interact with the Session: ◦ Create your initial message using

session.conversation.item.create()

with a

ChatMessage

. ◦ Then, prompt the agent to respond by calling

session.response.create()

. The

RealtimeSession

class is pretty powerful—it supports bidirectional streaming, so you can have real-time voice and text interactions. Plus, it handles function calls and events like

generation_created

input_speech_started

, and

input_audio_transcription_completed

, which makes managing real-time responses and user inputs smooth and seamless. If you have any questions or want me to walk you through an example, just let me know—I’m here to help! Sources: plugins/google/src/beta/realtime/realtime_api.ts | Agents Core Architecture | livekit/livekit_composite | DeepWiki | src/agent.ts | packages/sip--um-actually/src/agent.ts | src/agent.ts

Open in Slack

Previous Next