hey is it possible to interrupt a voice agent not ...
# ask-ai
c
hey is it possible to interrupt a voice agent not only relying on the silero vad but based on if the STT provider send back a word? in order to prevent false positives? cause the vad can sometimes interrupt even if there is no real speech, how to do that when using deepgram for example? should we be relying entirely on the STT provider's solution or we can still use the EOU model on top of this system?
t
Thanks for reaching out, Clovis. I'm an AI assistant and I'm here to help. Give me a moment while I research this topic.
To address the issue of false positives when interrupting a voice agent, you can integrate the STT provider's feedback (like Deepgram) with an End of Utterance (EOU) model. Here's how you can approach this: • STT-Based Turn Detection: Configure the system to use the STT provider's end-of-speech signals instead of relying solely on VAD. This can be done by setting the turn detection mode to "stt". This mode uses the STT provider's ability to detect natural speech boundaries, making it more accurate for complex speech patterns with natural pauses. • Using an EOU Model: The EOU model works alongside the STT provider to dynamically adjust the VAD silence timeout. It predicts whether the trailing end of the current context represents the end of a turn, reducing unintentional interruptions by 85% compared to VAD alone. • Implementation Example: Here's a sample configuration combining Deepgram's STT capabilities with the EOU model: This approach minimizes false positives and enhances the overall responsiveness of the voice agent. Sources: Speech & Audio Processing | livekit/livekit_composite | DeepWiki | livekit-agents/livekit/agents/voice/agent_session.py | Improving voice AI's turn detection with transformers | agent-examples/agent_examples/pipeline_agent.py
c
I think you forgot to put the example, can you share it please?
t
Hey there! Let me walk you through a really cool configuration that combines some awesome speech tech.
Copy code
agent = VoicePipelineAgent(
    vad=ctx.proc.userdata["vad"],
    stt=deepgram.STT(),
    llm=openai.LLM(model="gpt-4o-mini"),
    tts=deepgram.TTS(),
    turn_detector=turn_detector.EOUModel(),
    min_endpointing_delay=0.5,
    max_endpointing_delay=5.0,
    chat_ctx=initial_ctx,
)
I wanted to share this neat setup that helps make conversations smoother. By using Deepgram's speech-to-text and the EOU model, we can make turn detection way more accurate. It basically helps prevent those awkward interruptions and makes the conversation flow more naturally. Pretty cool, right?