We've built a voice agent over SIP using the Deepg...
# ask-ai
s
We've built a voice agent over SIP using the Deepgram plugin for ASR, Cartesia for TTS, and a custom LLM microservice. The LLM is integrated into the agent via a custom plugin that communicates over gRPC. We're encountering an issue where the ASR sometimes skips parts of the user's speech—especially during interruptions—resulting in incomplete or incorrect transcriptions. This affects the bot's ability to respond accurately. For example, in the attached screenshot, the user said "Yeah, it’s fine", but the ASR didn’t provide a final transcription for that part. A few seconds later, when the user said "Hello", the ASR finalized only "Hello", completely skipping the earlier phrase. Since the initial sentence wasn’t finalized, it never reached the LLM for processing. Has anyone experienced similar behavior with Deepgram ASR in real-time SIP-based systems? Any advice on improving transcription reliability, particularly in interruption or barge-in scenarios? Thanks in advance! Slack Conversation
t
Thanks for reaching out, Navin. I'm an AI assistant and I'm here to help. Give me a moment while I research this topic.
Could you provide more details about your current Deepgram ASR configuration? Specifically, are you using any custom settings or parameters for handling interruptions or barge-in scenarios? Additionally, are there any specific latency or network conditions that might be affecting the transcription process?
s
Copy code
stt_kwargs = dict(
    model=model_id,
    language=asr_settings.get("language", "en-US"),
    endpointing_ms=int(asr_settings.get("silence_threshold", 300)),
    smart_format=asr_settings.get("smart_format", True),
    interim_results=asr_settings.get("interim_results", True),
    api_key=os.getenv("DEEPGRAM_API_KEY")
)
these are ASR setting we are using
t
Hey there! Let me break down how Deepgram's ASR settings can really help you manage those tricky conversation moments: • endpointing_ms: Think of this like a pause button for speech. It decides how long of a silence means someone's done talking. If you set it too low, the system might cut people off too quickly, but too high and it won't catch interruptions. The default is 25ms, and you can turn it off completely if you want. • smart_format: This is like having a smart assistant that automatically cleans up numbers and dates. When paired with "no_delay", it gives you super quick results - perfect for those real-time chats where every second counts! • interim_results: This is your best friend for handling interruptions. It gives you ongoing feedback as people are speaking, so you're always in the loop and can catch those mid-conversation shifts. Pro tip: Playing around with the endpointing_ms can really help you fine-tune how smoothly your system handles those unexpected conversation twists. Why not experiment a bit and see what works best for you? 😊 Punctuation is on by default and can also give your turn detection a nice boost. Sources: livekit.plugins.deepgram API documentation | GitHub::livekit::agents | livekit-plugins/livekit-plugins-deepgram/livekit/...