Why sometime a user single message is broken into ...
# ask-ai
m
Why sometime a user single message is broken into multiple transcription , i am using deepgram STT.
t
Thanks for reaching out, Gaurav. I'm an AI assistant and I'm here to help. Give me a moment while I research this topic.
I’d love to help explain this a bit! When using Deepgram STT, a single user message might sometimes be split into multiple transcriptions because of how real-time streaming transcription works, especially with silence detection. Deepgram processes audio in 50ms chunks and uses an
endpointing_ms
parameter (which is set to 25ms by default) to figure out when someone has finished speaking. So, if there’s a brief pause while the user is talking, you might see interim transcriptions. The system only finalizes the current utterance—and possibly splits the message into parts—when the silence lasts longer than the
endpointing_ms
threshold. If you’d like, I’m here to help you understand this better or guide you through adjusting these settings. Just let me know! Sources: livekit-plugins/livekit-plugins-baseten/livekit/plugins/baseten/stt.py | Deepgram integration guide | LiveKit Docs | Deepgram STT integration guide | LiveKit Docs | VoiceAgent/Chat/Receive/TranscriptionStreamReceiver.swift
m
but we have increases the endpointing_ms to 500
t
I understand that even with
endpointing_ms
set to 500ms, messages might still get split into multiple transcriptions because LiveKit uses its own turn detection system. This system relies on voice activity detection (VAD) with a parameter called
min_endpointing_delay
, which is set to 500ms by default. This parameter decides how long the silence needs to be to consider a user's turn finished. If you'd like to reduce the segmentation, you can try increasing
min_endpointing_delay
to around 1000-1500ms for a smoother transcription experience. You can also explore adjusting other settings. Please feel free to reach out if you'd like any help or guidance with modifying these parameters—I’m here to support you! Sources: Improving voice AI's turn detection with transformers | VoicePipelineAgent | LiveKit Docs | VoicePipelineAgent | LiveKit Docs | agents/src/voice/audio_recognition.ts | Deepgram STT integration guide | LiveKit Docs
m
I think it will increase the latency , if the user message was split using the livekit VAD , will the incomplete message start getting processing directly ?
t
Thanks for reaching out! I wanted to share that LiveKit doesn’t process incomplete messages right away when they’re split by the VAD system. Instead, it gently buffers the audio input until it detects a full speech segment, using settings like
min_silence_duration
or
min_endpointing_delay
to guide this. This approach helps make sure that only complete and polished utterances get processed, which keeps your transcriptions accurate and reliable. If you’d like, I’m here to walk you through how to tweak these settings or improve latency—just let me know! Sources: Speech & Audio Processing | livekit/livekit_composite | DeepWiki | AI voice agents | LiveKit Docs | livekit-plugins/livekit-plugins-speechmatics/livekit/plugins/speechmatics/s... | Speech-to-text (STT) integrations | LiveKit Docs | Working with plugins | LiveKit Docs