This message was deleted LiveKit Community #ask-ai

Join Slack

This message was deleted.

# ask-ai

steep-balloon-41261

08/06/2025, 10:56 AM

This message was deleted.

tall-airline-16756

08/06/2025, 10:56 AM

Thanks for reaching out, Anigilaje. I'm an AI assistant and I'm here to help. Give me a moment while I research this topic.

tall-airline-16756

08/06/2025, 10:56 AM

To add vision capabilities to your LiveKit setup, you need to enable video input in the

RoomInputOptions

. Here's how you can modify your code: 1. Set

video_enabled=True

in the

RoomInputOptions

when starting the session. This will enable video input for the room. 2. Optionally, you can configure noise cancellation using LiveKit's plugins like

noise_cancellation.BVC()

. Here's an example of how to update your code:

Copy code

await session.start(
    room=ctx.room,
    agent=agent,
    room_input_options=RoomInputOptions(video_enabled=True, text_enabled=True, audio_enabled=True),
    room_output_options=RoomOutputOptions(transcription_enabled=True, audio_enabled=True)
)

If you need further assistance or have specific requirements for vision capabilities, let me know! Sources: Vision | LiveKit Docs | complex-agents/vision/agent.py | Web and mobile frontends | LiveKit Docs

tall-airline-16756

08/06/2025, 10:59 AM

Hey there! Enabling

video_enabled=True

RoomInputOptions

lets the agent receive video frames from the camera, which is a great start. But to actually have the agent respond to what it "sees" (like answering questions based on the camera input), you'll want to add a vision model, such as Gemini Live, into your setup. Just turning on video input isn’t quite enough for the agent to understand and react to visual info. You’ll need to connect it with a vision-capable model that can analyze those video frames and come up with responses based on the visuals. If you’d like, I’m here to help walk you through integrating a vision model—just give me a shout!

tall-airline-16756

08/06/2025, 11:00 AM

Hey there! Just a quick heads-up: the GPT-4.1-mini model you're using doesn’t support vision features for handling video input. If you want the agent to understand visual stuff, you’ll need to switch to a vision-capable model like Gemini Live or GPT-4-vision-preview. If you’d like a hand setting that up or want to explore some other options, just give me a shout—I’m here to help! Sources: Large language model (LLM) integrations | LiveKit Docs | tools/mtmd/legacy-models/minicpmv-convert-image-encoder-to-gguf.py | Vision | LiveKit Docs | livekit-plugins/livekit-plugins-openai/livekit/plugins/openai/models.py | ChatModels | LiveKit Agents

tall-airline-16756

08/06/2025, 11:00 AM

I don't have the answer you're looking for. You could also try asking your question: • in one of the other Slack channels or • to https://deepwiki.com/livekit/livekit_composite which is trained on all LiveKit source code If you find the answer, please post it here to help others!

Open in Slack

Previous Next