help with vision < millions hair 24956> LiveKit Community #ask-ai

Join Slack

help with vision <@U084APK40NS>

# ask-ai

tall-mechanic-24834

02/28/2025, 3:15 AM

help with vision @millions-hair-24956

gray-monkey-83525

02/28/2025, 3:15 AM

Could you clarify what specific aspect of vision you need help with? Are you looking to integrate vision capabilities into a LiveKit application, or do you have questions about using vision with LiveKit's APIs? For example, are you trying to process video frames, use vision in a multimodal agent, or something else? Please provide more details so I can assist you better. Sources: - https://livekit-users.slack.com/archives/C07FY8WHGPM/p1733148578.269969 - https://livekit-users.slack.com/archives/C07FYCB7FRQ/p1730227871.901309 👍 0 button 👎 0 button 💡 Hint: Mention @runllm in the thread for followups.

tall-mechanic-24834

02/28/2025, 3:16 AM

@millions-hair-24956 # Imports (Vision Related) from livekit import rtc from livekit.agents.llm import ChatMessage, ChatImage # Helper Functions (get_video_track and get_latest_image) async def get_video_track(room: rtc.Room): """Find and return the first available remote video track in the room.""" for participant_id, participant in room.remote_participants.items(): for track_id, track_publication in participant.track_publications.items(): if track_publication.track and isinstance( track_publication.track, rtc.RemoteVideoTrack ): logger.info( f"Found video track {track_publication.track.sid} " f"from participant {participant_id}" ) return track_publication.track logger.warning("No remote video track found in the room") # Keep warning log for context return None async def get_latest_image(room: rtc.Room): """Capture and return a single frame from the video track.""" video_stream = None try: video_track = await get_video_track(room) if not video_track: # Keep check for no video track return None video_stream = rtc.VideoStream(video_track) async for event in video_stream: logger.info("Captured latest video frame") return event.frame except Exception as e: logger.error(f"Failed to get latest image: {e}") return None finally: if video_stream: await video_stream.aclose() # before_llm_cb function async def before_llm_cb(agent, chat_ctx): # Important: Show 'agent' argument """Callback that runs right before the LLM generates a response.""" if not agent.enable_vision: # Important: Show 'agent.enable_vision' check logger.debug("Vision is not enabled for agent, skipping vision processing.") return logger.info("Attempting to capture video frame for vision...") latest_image = await get_latest_image(ctx.room) if latest_image: logger.info("Successfully captured video frame") image_content = [ChatImage(image=latest_image)] chat_ctx.messages.append(ChatMessage(role="user", content=image_content)) logger.info("Added visual content to conversation context") else: logger.warning("No video frame was captured") # Agent Initialization in
entrypoint
(Crucial Part) agent = InterruptibleSalesAgent( # Show InterruptibleSalesAgent class persona=persona, room_name=ctx.room.name, vad=ctx.proc.userdata["vad"], stt=deepgram.STT(model="nova-3-general", api_key=deepgram_api_key), llm=google.LLM( model=google_model, # Show LLM model being used api_key=google_api_key, temperature=0.7, max_output_tokens=100, ), tts=cartesia.TTS( model=tts_model, voice=tts_voice, emotion=tts_emotion, api_key=cartesia_api_key, ), turn_detector=turn_detector.EOUModel(), min_endpointing_delay=min_endpointing_delay, max_endpointing_delay=max_endpointing_delay, chat_ctx=initial_ctx, before_llm_cb=before_llm_cb # VERY IMPORTANT: Show before_llm_cb is passed here ) # Setting agent properties (including enable_vision) agent.recording_url = recording_url agent.egress_id = egress_id agent.webhook_url = webhook_url agent.webhook_secret = webhook_secret agent.enable_vision = enable_vision # Show enable_vision assignment # ... (rest of agent property settings if any) ... # Vision Enabling Log if enable_vision: logger.info("Vision enabled: set up before_llm_callback") # Show the log line

I'm working on a Python agent using the Agents Framework and trying to enable vision, but it doesn't seem to be working. I've set
enable_vision
to
True
in my job metadata, and I've implemented a
before_llm_cb
function as shown in the documentation to capture video frames and add them to the LLM context.

However, my agent isn't describing anything visual in its responses. I'm using the Google Gemini API (model:
gemini-2.0-flash
, but I can change this if needed) for the LLM. I'm publishing video from my client to the LiveKit room.

Could you please help me diagnose why vision isn't functioning?

Here's the relevant code from my
agent.py
file for your review (I've included the imports, helper functions for video, the
before_llm_cb
function, and how I initialize my
InterruptibleSalesAgent
including passing
before_llm_cb
and setting
agent.enable_vision
):

gray-monkey-83525

02/28/2025, 3:16 AM

It seems likely that the issue might be related to how the video frames are being captured and processed before being added to the LLM context. Here are a few things to check and consider: 1. Video Track Availability: Ensure that the video track is indeed being published to the room and is accessible. You can verify this by checking the logs to see if the video track is found and if frames are being captured. 2. Frame Capture: The

get_latest_image

function should be capturing frames correctly. Make sure that the

rtc.VideoStream

is properly initialized and that frames are being iterated over. If no frames are captured, it might indicate an issue with the video stream setup. 3. LLM Context Update: Ensure that the

before_llm_cb

function is correctly appending the captured image to the

chat_ctx.messages

. The

ChatImage

should be properly instantiated with the captured frame. 4. Enable Vision: Double-check that

agent.enable_vision

is set to

True

and that this setting is being respected in the

before_llm_cb

function. 5. Logging: Use logging to trace the flow of execution and verify that each step is being executed as expected. This includes checking if the

before_llm_cb

is being called and if the image is being added to the context. If you would like more detailed guidance, please let me know, and I can break down the steps further. Sources: - https://livekit-users.slack.com/archives/C07FY8WHGPM/p1733239980.044189 - https://livekit-users.slack.com/archives/C07FY8WHGPM/p1730233417.503479 👍 0 button 👎 0 button 💡 Hint: Mention @runllm in the thread for followups.

tall-mechanic-24834

02/28/2025, 4:40 AM

@millions-hair-24956 i know u can have the agent be interrupted, but can i have a agent interrupt us? as the user?

gray-monkey-83525

02/28/2025, 4:40 AM

Yes, an agent can be configured to interrupt a user. This can be useful in scenarios where the user is speaking for too long or if the agent needs to interject with important information. To achieve this, you can use a classifier to analyze the user's speech in real-time and determine if an interruption is necessary based on predefined criteria. For example, you can use the

VoicePipelineAgent

and set up a mechanism to monitor the user's speech. If the speech exceeds a certain length or matches specific patterns, the agent can interrupt by starting a new speech task. This involves using the

before_llm_cb

callback to modify the conversation context or directly invoking the agent's

say

method with

allow_interruptions=True

. Would you like more detailed guidance on implementing this feature? Sources: - https://docs.livekit.io/agents/voice-agent/voice-pipeline/ 👍 0 button 👎 0 button 💡 Hint: Mention @runllm in the thread for followups.

Open in Slack

Previous Next