LiveKit uses Voice Activity Detection (VAD) to handle user speech chunking before sending audio to Speech-to-Text (STT) systems. This process is configurable through various parameters depending on the specific VAD implementation being used.
Key Details About Speech Chunking:
•
Voice Activity Detection (VAD): LiveKit employs VAD to detect the start and end of speech chunks. For example, the Silero VAD plugin provides configuration options such as:
◦ `min_speech_duration`: Minimum duration of speech required to start a new speech chunk (default: 0.05 seconds).
◦ `min_silence_duration`: Duration of silence to wait after speech ends before considering the chunk complete (default: 0.55 seconds).
◦ `prefix_padding_duration`: Duration of padding added to the beginning of each speech chunk (default: 0.5 seconds).
◦ `max_buffered_speech`: Maximum duration of speech that can be buffered (default: 60.0 seconds).
•
Turn Detection and Endpointing: LiveKit combines VAD with additional strategies for turn detection, including endpointing delays to determine when a user has finished speaking.
•
Buffering and Streaming: For STT systems that don’t support streaming input, LiveKit uses VAD with the
StreamAdapter
class to buffer audio until VAD detects the end of speech.
These parameters allow customization of how speech chunks are formed and sent to STT. Let me know if you’d like more details on configuring these options or applying them to a specific use case.
Sources:
Silero VAD plugin | LiveKit Docs |
Voice Assistants | livekit/livekit_composite | DeepWiki |
Speech-to-text (STT) integrations | LiveKit Docs |
ElevenLabs integration guide | LiveKit Docs