busy-monitor-72069
09/14/2025, 2:51 AMtall-airline-16756
09/14/2025, 2:51 AMtall-airline-16756
09/14/2025, 2:51 AMbusy-monitor-72069
09/14/2025, 2:52 AMtall-airline-16756
09/14/2025, 2:52 AMllm_node
method to catch and validate the LLM output as it happens. You could use a secondary LLM or your own custom logic to make sure the content is appropriate and correct before it gets spoken.
• LLM-Based Judgment: Using LLM-based evaluation methods can help you see if the agent's responses align well with the intended behavior or intent. This works great both for real-time checks and testing.
• Behavioral Testing: Writing tests to validate the accuracy of agent responses, tool usage, and error handling is a wonderful way to make sure your agent behaves as expected across different situations.
Feel free to combine these techniques to build a strong validation layer for your voice agent. And if you'd like, I'm here to help with more details or examples on how to implement any of these methods—just let me know!
Sources: pipeline-llm/llm_powered_content_filter.py | Testing and evaluation | LiveKit Docs | Docs::Agents | livekit-agents/livekit/agents/voice/generation.pybusy-monitor-72069
09/14/2025, 2:53 AMtall-airline-16756
09/14/2025, 2:54 AMarguments = tool.arguments
if not arguments:
raise ValueError("LLM did not return any arguments for evaluation.")
assert isinstance(arguments, str)
# Call a validation function
success, reason = await check_intent(*fnc_args, **fnc_kwargs)
if not success:
raise ValueError(f"Validation failed: {reason}")
• Content Filtering: You might consider overriding the llm_node
method to intercept and validate the LLM output as it happens. For example, using a secondary LLM to evaluate the content could look like this:
async def evaluate_content(self, text: str) -> bool:
moderation_ctx = ChatContext([
ChatMessage(
type="message",
role="system",
content=["Respond ONLY with 'APPROPRIATE' or 'INAPPROPRIATE'. Respond with 'INAPPROPRIATE' if the text mentions strawberries."]
),
ChatMessage(type="message", role="user", content=[f"Evaluate: {text}"])
])
response = ""
async with self.moderator_llm.chat(chat_ctx=moderation_ctx) as stream:
async for chunk in stream:
content = getattr(chunk.delta, 'content', None) if hasattr(chunk, 'delta') else str(chunk)
if content:
response += content
return "INAPPROPRIATE" not in response.strip().upper()
• LLM-Based Judgment: Another approach is to use the .judge()
method to assess the quality of the LLM's response. Here's an example:
await result.expect.next_event().is_message(role="assistant").judge(
llm, intent="Offers a friendly introduction and offer of assistance."
)
• Behavioral Testing: Writing tests is a great way to ensure your agent responds correctly, uses tools properly, and handles errors gracefully. For example:
# Test for correct function call
result = await sess.run(user_input="What's the weather in Tokyo?")
result.expect.next_event().is_function_call(name="lookup_weather", arguments={"location": "Tokyo"})
# Test for error handling
with mock_tools(DriveThruAgent, {"order_regular_item": lambda: RuntimeError("test failure")}):
result = await sess.run(user_input="Can I get a large vanilla shake?")
await result.expect.next_event().is_message(role="assistant").judge(
llm, intent="should inform the user that an error occurred"
)
These tests are really helpful to make sure your agent behaves safely and correctly in different situations. If you have any questions or want to dive deeper into any of these examples, please don’t hesitate to reach out—I’m here to help!