help with vision <@U084APK40NS>
# ask-ai
t
help with vision @millions-hair-24956
g
Could you clarify what specific aspect of vision you need help with? Are you looking to integrate vision capabilities into a LiveKit application, or do you have questions about using vision with LiveKit's APIs? For example, are you trying to process video frames, use vision in a multimodal agent, or something else? Please provide more details so I can assist you better. Sources: - https://livekit-users.slack.com/archives/C07FY8WHGPM/p1733148578.269969 - https://livekit-users.slack.com/archives/C07FYCB7FRQ/p1730227871.901309 👍 0 button 👎 0 button 💡 Hint: Mention @runllm in the thread for followups.
t
@millions-hair-24956 # Imports (Vision Related) from livekit import rtc from livekit.agents.llm import ChatMessage, ChatImage # Helper Functions (get_video_track and get_latest_image) async def get_video_track(room: rtc.Room): """Find and return the first available remote video track in the room.""" for participant_id, participant in room.remote_participants.items(): for track_id, track_publication in participant.track_publications.items(): if track_publication.track and isinstance( track_publication.track, rtc.RemoteVideoTrack ): logger.info( f"Found video track {track_publication.track.sid} " f"from participant {participant_id}" ) return track_publication.track logger.warning("No remote video track found in the room") # Keep warning log for context return None async def get_latest_image(room: rtc.Room): """Capture and return a single frame from the video track.""" video_stream = None try: video_track = await get_video_track(room) if not video_track: # Keep check for no video track return None video_stream = rtc.VideoStream(video_track) async for event in video_stream: logger.info("Captured latest video frame") return event.frame except Exception as e: logger.error(f"Failed to get latest image: {e}") return None finally: if video_stream: await video_stream.aclose() # before_llm_cb function async def before_llm_cb(agent, chat_ctx): # Important: Show 'agent' argument """Callback that runs right before the LLM generates a response.""" if not agent.enable_vision: # Important: Show 'agent.enable_vision' check logger.debug("Vision is not enabled for agent, skipping vision processing.") return logger.info("Attempting to capture video frame for vision...") latest_image = await get_latest_image(ctx.room) if latest_image: logger.info("Successfully captured video frame") image_content = [ChatImage(image=latest_image)] chat_ctx.messages.append(ChatMessage(role="user", content=image_content)) logger.info("Added visual content to conversation context") else: logger.warning("No video frame was captured") # Agent Initialization in
entrypoint
(Crucial Part)
agent = InterruptibleSalesAgent( # Show InterruptibleSalesAgent class persona=persona, room_name=ctx.room.name, vad=ctx.proc.userdata["vad"], stt=deepgram.STT(model="nova-3-general", api_key=deepgram_api_key), llm=google.LLM( model=google_model, # Show LLM model being used api_key=google_api_key, temperature=0.7, max_output_tokens=100, ), tts=cartesia.TTS( model=tts_model, voice=tts_voice, emotion=tts_emotion, api_key=cartesia_api_key, ), turn_detector=turn_detector.EOUModel(), min_endpointing_delay=min_endpointing_delay, max_endpointing_delay=max_endpointing_delay, chat_ctx=initial_ctx, before_llm_cb=before_llm_cb # VERY IMPORTANT: Show before_llm_cb is passed here ) # Setting agent properties (including enable_vision) agent.recording_url = recording_url agent.egress_id = egress_id agent.webhook_url = webhook_url agent.webhook_secret = webhook_secret agent.enable_vision = enable_vision # Show enable_vision assignment # ... (rest of agent property settings if any) ... # Vision Enabling Log if enable_vision: logger.info("Vision enabled: set up before_llm_callback") # Show the log line
I'm working on a Python agent using the Agents Framework and trying to enable vision, but it doesn't seem to be working. I've set
enable_vision
to
True
in my job metadata, and I've implemented a
before_llm_cb
function as shown in the documentation to capture video frames and add them to the LLM context.
However, my agent isn't describing anything visual in its responses. I'm using the Google Gemini API (model:
gemini-2.0-flash
, but I can change this if needed) for the LLM. I'm publishing video from my client to the LiveKit room.
Could you please help me diagnose why vision isn't functioning?
Here's the relevant code from my
agent.py
file for your review (I've included the imports, helper functions for video, the
before_llm_cb
function, and how I initialize my
InterruptibleSalesAgent
including passing
before_llm_cb
and setting
agent.enable_vision
):
g
It seems likely that the issue might be related to how the video frames are being captured and processed before being added to the LLM context. Here are a few things to check and consider: 1. Video Track Availability: Ensure that the video track is indeed being published to the room and is accessible. You can verify this by checking the logs to see if the video track is found and if frames are being captured. 2. Frame Capture: The
get_latest_image
function should be capturing frames correctly. Make sure that the
rtc.VideoStream
is properly initialized and that frames are being iterated over. If no frames are captured, it might indicate an issue with the video stream setup. 3. LLM Context Update: Ensure that the
before_llm_cb
function is correctly appending the captured image to the
chat_ctx.messages
. The
ChatImage
should be properly instantiated with the captured frame. 4. Enable Vision: Double-check that
agent.enable_vision
is set to
True
and that this setting is being respected in the
before_llm_cb
function. 5. Logging: Use logging to trace the flow of execution and verify that each step is being executed as expected. This includes checking if the
before_llm_cb
is being called and if the image is being added to the context. If you would like more detailed guidance, please let me know, and I can break down the steps further. Sources: - https://livekit-users.slack.com/archives/C07FY8WHGPM/p1733239980.044189 - https://livekit-users.slack.com/archives/C07FY8WHGPM/p1730233417.503479 👍 0 button 👎 0 button 💡 Hint: Mention @runllm in the thread for followups.
t
@millions-hair-24956 i know u can have the agent be interrupted, but can i have a agent interrupt us? as the user?
g
Yes, an agent can be configured to interrupt a user. This can be useful in scenarios where the user is speaking for too long or if the agent needs to interject with important information. To achieve this, you can use a classifier to analyze the user's speech in real-time and determine if an interruption is necessary based on predefined criteria. For example, you can use the
VoicePipelineAgent
and set up a mechanism to monitor the user's speech. If the speech exceeds a certain length or matches specific patterns, the agent can interrupt by starting a new speech task. This involves using the
before_llm_cb
callback to modify the conversation context or directly invoking the agent's
say
method with
allow_interruptions=True
. Would you like more detailed guidance on implementing this feature? Sources: - https://docs.livekit.io/agents/voice-agent/voice-pipeline/ 👍 0 button 👎 0 button 💡 Hint: Mention @runllm in the thread for followups.