<https://docs.livekit.io/agents/v0/voice-agent/mul...
# ask-ai
r
https://docs.livekit.io/agents/v0/voice-agent/multimodal-agent/ Could i find the multimodal model for gemini docs in livekit
t
Thanks for reaching out, Hrithik. I'm an AI assistant and I'm here to help. Give me a moment while I research this topic.
I'm glad you're interested in using LiveKit with Google's Gemini Live API! This integration supports a wonderful multimodal model within the LiveKit Agents framework, allowing you to enjoy smooth, low-latency, two-way interactions using text, audio, and video input. The outputs come in both audio and text, making it perfect for creating voice AI applications that feel natural and human-like. To get started with Gemini, you can easily install the Google plugin through PyPI with this command:
Copy code
pip install "livekit-agents[google]~=1.2"
For authentication, you'll just need to set the appropriate environment variables depending on whether you're using Vertex AI or the Google Gemini API. For instance, setting
GOOGLE_API_KEY
is necessary for the Gemini API. When it comes to implementation, you have the option to use the
RealtimeModel
class for straightforward setup or the
MultimodalAgent
class if you're looking for more advanced control. Here's a simple example to guide you:
Copy code
from livekit.plugins import google

session = AgentSession(
    llm=google.beta.realtime.RealtimeModel(
        model="gemini-2.0-flash-exp",
        voice="Puck",
        temperature=0.8,
        instructions="You are a helpful assistant",
    ),
)
If you'd like to dive deeper, feel free to check out the Google Gemini integration guide or the Gemini Live API documentation. And of course, if you have any questions or need a hand along the way, I'm here to help—just let me know! Sources: Gemini Live API integration guide | LiveKit Docs | Google Gemini integration guide | LiveKit Docs
r
I am interensted in typescript
t
I'm excited to share that LiveKit supports building multimodal models in TypeScript using the
@livekit/agents
SDK. This lets you create AI agents that can process and generate both audio and text in real time, offering a really dynamic experience. Here’s a simple example of how you can define a multimodal agent in TypeScript:
Copy code
import { JobContext, WorkerOptions, cli, defineAgent, multimodal } from '@livekit/agents';
import * as openai from '@livekit/agents-plugin-openai';

export default defineAgent({
  entry: async (ctx: JobContext) => {
    await ctx.connect();

    const agent = new multimodal.MultimodalAgent({
      model: new openai.realtime.RealtimeModel({
        instructions: "You are a helpful assistant.",
        voice: 'shimmer',
        modalities: ['text', 'audio'],
      }),
    });

    await agent.start(ctx.room);
  },
});

cli.runApp(new WorkerOptions({ agent: import.meta.url, workerType: JobType.JT_ROOM }));
This example uses the
@livekit/agents
npm package, which works seamlessly with TypeScript. You can easily customize the agent to handle text, audio, or both by adjusting the
modalities
parameter. If you have any questions or would like me to walk you through more examples, just let me know—I'm here to help! Sources: AI voice agents | LiveKit Docs | .eslintrc | web/src/components/code-viewer.tsx
I'm excited to share that LiveKit supports building multimodal models in TypeScript using the
@livekit/agents
SDK. This wonderful feature lets you create AI agents that can process and generate both audio and text in real time, making your applications even more interactive and engaging. Here's a simple example of how you can define a multimodal agent in TypeScript:
Copy code
import { JobContext, WorkerOptions, cli, defineAgent, multimodal } from '@livekit/agents';
import * as openai from '@livekit/agents-plugin-openai';

export default defineAgent({
  entry: async (ctx: JobContext) => {
    await ctx.connect();

    const agent = new multimodal.MultimodalAgent({
      model: new openai.realtime.RealtimeModel({
        instructions: "You are a helpful assistant.",
        voice: 'shimmer',
        modalities: ['text', 'audio'],
      }),
    });

    await agent.start(ctx.room);
  },
});

cli.runApp(new WorkerOptions({ agent: import.meta.url, workerType: JobType.JT_ROOM }));
This example uses the
@livekit/agents
npm package, which works seamlessly with TypeScript. You have the flexibility to configure modalities like text, audio, or both, and tailor the agent’s behavior to suit your needs. If you have any questions or would like more examples, please don’t hesitate to ask—I’m here to help!
r
does it work with typescript and gemini for multimodal models
t
Thanks for reaching out! Just to make sure I understand you correctly, are you looking to integrate Google's Gemini API specifically with TypeScript for multimodal models in LiveKit? If that's the case, could you share a bit more about what you're hoping to achieve—like real-time audio and text processing, or maybe a particular use case you have in mind? I'm here to help you through this!
r
i want to make a realtime audio platform whit typescript, gemini and using mulimodal model
t
I'm glad you're exploring how to create a real-time audio platform using TypeScript, Gemini, and a multimodal model in LiveKit! You can make this happen smoothly by using the
@livekit/agents-plugin-google
package, which helps you integrate Google's Gemini API for real-time audio and text processing. Here’s a simple example to help you get started:
Copy code
import * as google from '@livekit/agents-plugin-google';
import { MultimodalAgent } from '@livekit/agents';

const agent = new MultimodalAgent({
  model: new google.beta.realtime.RealtimeModel({
    model: 'gemini-2.0-flash-exp',
    voice: 'Puck',
    modalities: ['AUDIO', 'TEXT'],
    instructions: 'You are a helpful assistant.',
    apiKey: process.env.GOOGLE_API_KEY,
  }),
});

agent.start(room);
Here’s a friendly step-by-step to guide you through the setup: 1. First, install the necessary packages by running:
npm install @livekit/agents @livekit/agents-plugin-google
. 2. Next, make sure to set up your authentication by configuring the
GOOGLE_API_KEY
environment variable. 3. Then, use the
RealtimeModel
class to configure the Gemini model, specifying
['AUDIO', 'TEXT']
for multimodal support. 4. Finally, start the agent in a LiveKit room to enable real-time audio interactions. This setup will empower you to process audio input and generate real-time audio and text responses using Gemini’s multimodal capabilities. If you have any questions or want me to walk you through more examples, just let me know—I’m here to help!
r
how to use session const session = await agent ._start_(ctx.room, participant) ._then_((session) => session as openai.realtime.RealtimeSession);
t
To get started with a session using
RealtimeSession
in LiveKit TypeScript, here’s a simple guide to walk you through the process: 1. Initialize the RealtimeModel: Begin by creating an instance of
RealtimeModel
with your preferred settings—this could include instructions, voice options, or any model-specific configurations you want to use. 2. Set up the Agent: Next, use
multimodal.MultimodalAgent
with your initialized model. You can also add an optional tool context if you want to enable function calling. 3. Start the Session: Kick off the session by calling
agent.start(room, participant)
, and then type-cast the result to
openai.realtime.RealtimeSession
like this:
Copy code
const session = await agent
  .start(ctx.room, participant)
  .then((session) => session as openai.realtime.RealtimeSession);
1. Interact with the Session: ◦ Create your initial message using
session.conversation.item.create()
with a
ChatMessage
. ◦ Then, prompt the agent to respond by calling
session.response.create()
. The
RealtimeSession
class is pretty powerful—it supports bidirectional streaming, so you can have real-time voice and text interactions. Plus, it handles function calls and events like
generation_created
,
input_speech_started
, and
input_audio_transcription_completed
, which makes managing real-time responses and user inputs smooth and seamless. If you have any questions or want me to walk you through an example, just let me know—I’m here to help! Sources: plugins/google/src/beta/realtime/realtime_api.ts | Agents Core Architecture | livekit/livekit_composite | DeepWiki | src/agent.ts | packages/sip--um-actually/src/agent.ts | src/agent.ts