tall-mechanic-24834
02/28/2025, 3:15 AMgray-monkey-83525
02/28/2025, 3:15 AMtall-mechanic-24834
02/28/2025, 3:16 AMentrypoint
(Crucial Part)
agent = InterruptibleSalesAgent( # Show InterruptibleSalesAgent class
persona=persona,
room_name=ctx.room.name,
vad=ctx.proc.userdata["vad"],
stt=deepgram.STT(model="nova-3-general", api_key=deepgram_api_key),
llm=google.LLM(
model=google_model, # Show LLM model being used
api_key=google_api_key,
temperature=0.7,
max_output_tokens=100,
),
tts=cartesia.TTS(
model=tts_model,
voice=tts_voice,
emotion=tts_emotion,
api_key=cartesia_api_key,
),
turn_detector=turn_detector.EOUModel(),
min_endpointing_delay=min_endpointing_delay,
max_endpointing_delay=max_endpointing_delay,
chat_ctx=initial_ctx,
before_llm_cb=before_llm_cb # VERY IMPORTANT: Show before_llm_cb is passed here
)
# Setting agent properties (including enable_vision)
agent.recording_url = recording_url
agent.egress_id = egress_id
agent.webhook_url = webhook_url
agent.webhook_secret = webhook_secret
agent.enable_vision = enable_vision # Show enable_vision assignment
# ... (rest of agent property settings if any) ...
# Vision Enabling Log
if enable_vision:
logger.info("Vision enabled: set up before_llm_callback") # Show the log line
I'm working on a Python agent using the Agents Framework and trying to enable vision, but it doesn't seem to be working. I've settoenable_vision
in my job metadata, and I've implemented aTrue
function as shown in the documentation to capture video frames and add them to the LLM context.before_llm_cb
However, my agent isn't describing anything visual in its responses. I'm using the Google Gemini API (model:, but I can change this if needed) for the LLM. I'm publishing video from my client to the LiveKit room.gemini-2.0-flash
Could you please help me diagnose why vision isn't functioning?
Here's the relevant code from myfile for your review (I've included the imports, helper functions for video, theagent.py
function, and how I initialize mybefore_llm_cb
including passingInterruptibleSalesAgent
and settingbefore_llm_cb
):agent.enable_vision
gray-monkey-83525
02/28/2025, 3:16 AMget_latest_image
function should be capturing frames correctly. Make sure that the rtc.VideoStream
is properly initialized and that frames are being iterated over. If no frames are captured, it might indicate an issue with the video stream setup.
3. LLM Context Update: Ensure that the before_llm_cb
function is correctly appending the captured image to the chat_ctx.messages
. The ChatImage
should be properly instantiated with the captured frame.
4. Enable Vision: Double-check that agent.enable_vision
is set to True
and that this setting is being respected in the before_llm_cb
function.
5. Logging: Use logging to trace the flow of execution and verify that each step is being executed as expected. This includes checking if the before_llm_cb
is being called and if the image is being added to the context.
If you would like more detailed guidance, please let me know, and I can break down the steps further.
Sources:
- https://livekit-users.slack.com/archives/C07FY8WHGPM/p1733239980.044189
- https://livekit-users.slack.com/archives/C07FY8WHGPM/p1730233417.503479 👍 0 button 👎 0 button 💡 Hint: Mention @runllm in the thread for followups.tall-mechanic-24834
02/28/2025, 4:40 AMgray-monkey-83525
02/28/2025, 4:40 AMVoicePipelineAgent
and set up a mechanism to monitor the user's speech. If the speech exceeds a certain length or matches specific patterns, the agent can interrupt by starting a new speech task. This involves using the before_llm_cb
callback to modify the conversation context or directly invoking the agent's say
method with allow_interruptions=True
.
Would you like more detailed guidance on implementing this feature?
Sources:
- https://docs.livekit.io/agents/voice-agent/voice-pipeline/ 👍 0 button 👎 0 button 💡 Hint: Mention @runllm in the thread for followups.