is there a script that can directly use livekit va...
# ask-ai
h
is there a script that can directly use livekit vad?
t
Thanks for reaching out, YUNYI. I'm an AI assistant and I'm here to help. Give me a moment while I research this topic.
Absolutely! There are scripts and examples ready to help you use LiveKit's Voice Activity Detection (VAD) right away. If you're working with Python, the Silero VAD plugin is a great choice. Here's a simple example to get you started:
Copy code
from livekit.plugins import silero

# Load the VAD model
vad = silero.VAD.load(
    min_speech_duration=0.05,
    min_silence_duration=0.4,
    activation_threshold=0.5,
    sample_rate=16000
)

# Create a stream for processing audio
stream = vad.stream()

# Process audio frames
for frame in audio_frames:
    stream.push_frame(frame)

stream.end_input()

# Handle VAD events
async for event in stream:
    if event.type == vad.VADEventType.START_OF_SPEECH:
        print("Speech started")
    elif event.type == vad.VADEventType.END_OF_SPEECH:
        print("Speech ended")
    elif event.type == vad.VADEventType.INFERENCE_DONE:
        print(f"Speech probability: {event.probability}")
If you'd like to see a full example in action, check out this script: LiveKit Python Agents Examples. And hey, if you have any questions or a specific scenario you're working on, just let me know — I'm here to help! Sources: Silero VAD plugin | LiveKit Docs | livekit-plugins/livekit-plugins-silero/livekit/plugins/silero/__init__.py | plugins/silero/src/index.ts
h
Can you explan: from livekit.plugins import silero # Load the VAD model vad = silero.VAD.load( min_speech_duration=0.05, min_silence_duration=0.4, activation_threshold=0.5, sample_rate=16000 ) # Create a stream for processing audio stream = vad.stream() # Process audio frames for frame in audio_frames: stream.push_frame(frame) stream.end_input() # Handle VAD events async for event in stream: if event.type == vad.VADEventType.START_OF_SPEECH: print("Speech started") elif event.type == vad.VADEventType.END_OF_SPEECH: print("Speech ended") elif event.type == vad.VADEventType.INFERENCE_DONE: print(f"Speech probability: {event.probability}") what is audio_frames and START_OF_SPEECH, END_OF_SPEECH, INFERENCE_DONE mean? can you show me the source of this?
t
Here’s a quick rundown of the parts in your example: • audio_frames: This is the audio data that’s being worked on. Think of it as a series of little audio chunks (frames) that the VAD checks to figure out when someone is speaking. • START_OF_SPEECH: This pops up when the VAD spots the start of speech. It happens once the speech probability goes above the activation threshold and the speech has lasted long enough. • END_OF_SPEECH: This signals that a speech segment has ended. It kicks in when there’s enough silence after someone stops talking, passing the minimum silence time needed. • INFERENCE_DONE: This shows up after each chunk of audio is analyzed. It tells you the chance that speech is present, how long the analysis took, and if speech was detected in those frames. Each event also comes with extra info like the timestamp, how long the speech lasted, and the silence duration. The VAD looks at audio in chunks and keeps an eye on speech probability to figure out exactly when speech starts and ends. If you want to dive deeper into the source code or check out some examples, just let me know—I’m happy to point you in the right direction! Sources: examples/primitives/echo-agent.py | agents/src/stt/stream_adapter.ts
h
I want to dive deeper
t
Let’s take a closer look at LiveKit’s VAD (Voice Activity Detection) and break down the key details for you: • VAD Event Types:START_OF_SPEECH: This event fires when speech kicks off, based on hitting the activation threshold and minimum speech duration. ◦ END_OF_SPEECH: This one triggers when speech wraps up, after enough silence has been detected that goes beyond the minimum silence duration. ◦ INFERENCE_DONE: Emitted after processing an inference window, giving you info like speech probability, how long the inference took, and whether someone’s speaking. • Audio Frames: These are the chunks of audio data that the VAD processes, linked to the events: ◦ For START_OF_SPEECH, the frames include the audio that triggered the detection. ◦ For END_OF_SPEECH, the frames cover the entire speech segment. ◦ For INFERENCE_DONE, the frames represent the audio window that was processed. Here’s a quick Python example to illustrate:
Copy code
async for vad_event in vad_stream:
    if vad_event.type == VADEventType.START_OF_SPEECH:
        print("Speech started")
    elif vad_event.type == VADEventType.END_OF_SPEECH:
        print("Speech ended")
        speech_frames = vad_event.frames
Advanced Features: ◦ Real-time processing with thresholds you can tweak to fit your needs. ◦ Automatic resampling to handle different input sample rates smoothly. ◦ Speech buffering to make sure you capture the whole speech segment without missing a beat. If you’re interested in diving into the source code or want help with specific configurations, just give me a shout! I’m happy to point you to the right resources or examples. Sources: agents/src/stt/stream_adapter.ts | livekit-agents/livekit/agents/stt/stream_adapter.py
h
I mean the link
t
Feel free to check out the source code and examples for LiveKit VAD right here: LiveKit Python Agents Examples If you have any questions or need a hand with the implementation, just give me a shout—I’m here to help!