red-agent-48266
09/11/2025, 4:30 PMtall-airline-16756
09/11/2025, 4:30 PMtall-airline-16756
09/11/2025, 4:31 PMpip install "livekit-agents[google]~=1.2"
For authentication, you'll just need to set the appropriate environment variables depending on whether you're using Vertex AI or the Google Gemini API. For instance, setting GOOGLE_API_KEY
is necessary for the Gemini API.
When it comes to implementation, you have the option to use the RealtimeModel
class for straightforward setup or the MultimodalAgent
class if you're looking for more advanced control. Here's a simple example to guide you:
from livekit.plugins import google
session = AgentSession(
llm=google.beta.realtime.RealtimeModel(
model="gemini-2.0-flash-exp",
voice="Puck",
temperature=0.8,
instructions="You are a helpful assistant",
),
)
If you'd like to dive deeper, feel free to check out the Google Gemini integration guide or the Gemini Live API documentation. And of course, if you have any questions or need a hand along the way, I'm here to help—just let me know!
Sources: Gemini Live API integration guide | LiveKit Docs | Google Gemini integration guide | LiveKit Docsred-agent-48266
09/11/2025, 4:32 PMtall-airline-16756
09/11/2025, 4:32 PM@livekit/agents
SDK. This lets you create AI agents that can process and generate both audio and text in real time, offering a really dynamic experience.
Here’s a simple example of how you can define a multimodal agent in TypeScript:
import { JobContext, WorkerOptions, cli, defineAgent, multimodal } from '@livekit/agents';
import * as openai from '@livekit/agents-plugin-openai';
export default defineAgent({
entry: async (ctx: JobContext) => {
await ctx.connect();
const agent = new multimodal.MultimodalAgent({
model: new openai.realtime.RealtimeModel({
instructions: "You are a helpful assistant.",
voice: 'shimmer',
modalities: ['text', 'audio'],
}),
});
await agent.start(ctx.room);
},
});
cli.runApp(new WorkerOptions({ agent: import.meta.url, workerType: JobType.JT_ROOM }));
This example uses the @livekit/agents
npm package, which works seamlessly with TypeScript. You can easily customize the agent to handle text, audio, or both by adjusting the modalities
parameter.
If you have any questions or would like me to walk you through more examples, just let me know—I'm here to help!
Sources: AI voice agents | LiveKit Docs | .eslintrc | web/src/components/code-viewer.tsxtall-airline-16756
09/11/2025, 4:34 PM@livekit/agents
SDK. This wonderful feature lets you create AI agents that can process and generate both audio and text in real time, making your applications even more interactive and engaging.
Here's a simple example of how you can define a multimodal agent in TypeScript:
import { JobContext, WorkerOptions, cli, defineAgent, multimodal } from '@livekit/agents';
import * as openai from '@livekit/agents-plugin-openai';
export default defineAgent({
entry: async (ctx: JobContext) => {
await ctx.connect();
const agent = new multimodal.MultimodalAgent({
model: new openai.realtime.RealtimeModel({
instructions: "You are a helpful assistant.",
voice: 'shimmer',
modalities: ['text', 'audio'],
}),
});
await agent.start(ctx.room);
},
});
cli.runApp(new WorkerOptions({ agent: import.meta.url, workerType: JobType.JT_ROOM }));
This example uses the @livekit/agents
npm package, which works seamlessly with TypeScript. You have the flexibility to configure modalities like text, audio, or both, and tailor the agent’s behavior to suit your needs.
If you have any questions or would like more examples, please don’t hesitate to ask—I’m here to help!red-agent-48266
09/11/2025, 4:35 PMtall-airline-16756
09/11/2025, 4:37 PMred-agent-48266
09/11/2025, 4:38 PMtall-airline-16756
09/11/2025, 4:38 PM@livekit/agents-plugin-google
package, which helps you integrate Google's Gemini API for real-time audio and text processing.
Here’s a simple example to help you get started:
import * as google from '@livekit/agents-plugin-google';
import { MultimodalAgent } from '@livekit/agents';
const agent = new MultimodalAgent({
model: new google.beta.realtime.RealtimeModel({
model: 'gemini-2.0-flash-exp',
voice: 'Puck',
modalities: ['AUDIO', 'TEXT'],
instructions: 'You are a helpful assistant.',
apiKey: process.env.GOOGLE_API_KEY,
}),
});
agent.start(room);
Here’s a friendly step-by-step to guide you through the setup:
1. First, install the necessary packages by running: npm install @livekit/agents @livekit/agents-plugin-google
.
2. Next, make sure to set up your authentication by configuring the GOOGLE_API_KEY
environment variable.
3. Then, use the RealtimeModel
class to configure the Gemini model, specifying ['AUDIO', 'TEXT']
for multimodal support.
4. Finally, start the agent in a LiveKit room to enable real-time audio interactions.
This setup will empower you to process audio input and generate real-time audio and text responses using Gemini’s multimodal capabilities. If you have any questions or want me to walk you through more examples, just let me know—I’m here to help!red-agent-48266
09/13/2025, 12:09 PMtall-airline-16756
09/13/2025, 12:09 PMRealtimeSession
in LiveKit TypeScript, here’s a simple guide to walk you through the process:
1. Initialize the RealtimeModel: Begin by creating an instance of RealtimeModel
with your preferred settings—this could include instructions, voice options, or any model-specific configurations you want to use.
2. Set up the Agent: Next, use multimodal.MultimodalAgent
with your initialized model. You can also add an optional tool context if you want to enable function calling.
3. Start the Session: Kick off the session by calling agent.start(room, participant)
, and then type-cast the result to openai.realtime.RealtimeSession
like this:
const session = await agent
.start(ctx.room, participant)
.then((session) => session as openai.realtime.RealtimeSession);
1. Interact with the Session:
◦ Create your initial message using session.conversation.item.create()
with a ChatMessage
.
◦ Then, prompt the agent to respond by calling session.response.create()
.
The RealtimeSession
class is pretty powerful—it supports bidirectional streaming, so you can have real-time voice and text interactions. Plus, it handles function calls and events like generation_created
, input_speech_started
, and input_audio_transcription_completed
, which makes managing real-time responses and user inputs smooth and seamless.
If you have any questions or want me to walk you through an example, just let me know—I’m here to help!
Sources: plugins/google/src/beta/realtime/realtime_api.ts | Agents Core Architecture | livekit/livekit_composite | DeepWiki | src/agent.ts | packages/sip--um-actually/src/agent.ts | src/agent.ts