mammoth-shoe-93270
07/27/2025, 1:46 AMtall-airline-16756
07/27/2025, 1:46 AMtall-airline-16756
07/27/2025, 1:47 AMSTTCapabilities(streaming=False, interim_results=False)
or similar settings depending on your requirements.
• StreamAdapter for Non-Streaming Providers: LiveKit provides a StreamAdapter
class that works with Voice Activity Detection (VAD). This buffers audio until VAD detects the end of speech. Here’s an example:
• The StreamAdapter
ensures that audio is processed in chunks, even without a streaming API.
• Implementation Methods: The STT base class provides both streaming and non-streaming methods, such as recognize
(for audio buffers) and stream()
(for pushing audio frames).
While LiveKit prefers streaming providers for real-time performance, these mechanisms allow you to integrate non-streaming STT providers effectively.
Sources: Speech-to-text (STT) integrations | LiveKit Docs | agents/src/tts/tts.ts | STTCapabilities | LiveKit Agents | livekit-agents/livekit/agents/voice/agent.pymammoth-shoe-93270
07/27/2025, 1:47 AMtall-airline-16756
07/27/2025, 1:48 AMfrom livekit import agents
from some_vad_library import silero
from some_stt_library import openai
# Initialize the non-streaming STT provider
whisper_stt = openai.STT()
# Configure Voice Activity Detection (VAD)
vad = silero.VAD.load(
min_speech_duration=0.1, # Minimum duration of speech to consider
min_silence_duration=0.5 # Minimum silence duration to mark end of speech
)
vad_stream = vad.stream()
# Use StreamAdapter to buffer audio until VAD detects end of speech
stt = agents.stt.StreamAdapter(whisper_stt, vad_stream)
stt_stream = stt.stream()
# Example of pushing audio frames to the STT stream
with open("audio_file.wav", "rb") as audio_file:
while chunk := audio_file.read(1024): # Read audio in chunks
stt_stream.write(chunk)
# Finalize the stream and get the transcription
stt_stream.end()
for event in stt_stream:
print("Transcription:", event.text)
So, what's happening here? Let me break it down for you:
• We're setting up a non-streaming STT provider - think of it like preparing a translator for audio.
• We've got Voice Activity Detection (VAD) to help identify when someone's actually speaking.
• The StreamAdapter
helps us process audio chunks smoothly and get those transcriptions.
Feel free to tweak this example to match your specific STT provider and VAD library. It's super adaptable! Let me know if you want to dive deeper into how this works.mammoth-shoe-93270
07/27/2025, 1:49 AMtall-airline-16756
07/27/2025, 1:50 AMif not activity.stt.capabilities.streaming:
if not activity.vad:
raise RuntimeError(
f"The STT ({activity.stt.label}) does not support streaming, add a VAD to the AgentTask/VoiceAgent to enable streaming"
"Or manually wrap your STT in a stt.StreamAdapter"
)
wrapped_stt = stt.StreamAdapter(stt=wrapped_stt, vad=activity.vad)
And here's another way you could set it up manually:
whisper_stt = openai.STT()
vad = silero.VAD.load(
min_speech_duration=0.1,
min_silence_duration=0.5,
)
# StreamAdapter will buffer audio until VAD emits END_SPEAKING event
stt = agents.stt.StreamAdapter(whisper_stt, vad)
stt_stream = stt.stream()
Basically, this setup lets non-streaming speech-to-text providers join real-time conversations by smartly capturing and processing your speech segments. Pretty cool, right? 😊 Sources: livekit-agents/livekit/agents/tts/tts.py | Speech-to-text (STT) integrations | LiveKit Docsmammoth-shoe-93270
07/27/2025, 1:50 AMtall-airline-16756
07/27/2025, 1:51 AMfrom livekit import agents
from some_vad_library import silero
from some_stt_library import openai
# Step 1: Initialize the non-streaming STT provider
whisper_stt = openai.STT()
# Step 2: Configure Voice Activity Detection (VAD)
vad = silero.VAD.load(
min_speech_duration=0.1, # Minimum duration of speech to consider
min_silence_duration=0.5 # Minimum silence duration to mark end of speech
)
vad_stream = vad.stream()
# Step 3: Use StreamAdapter to buffer audio until VAD detects end of speech
stt = agents.stt.StreamAdapter(whisper_stt, vad_stream)
stt_stream = stt.stream()
# Step 4: Open the audio file and process it in chunks
with open("audio_file.wav", "rb") as audio_file:
while chunk := audio_file.read(1024): # Read audio in 1024-byte chunks
stt_stream.write(chunk) # Pass each chunk to the STT stream
# Step 5: Finalize the stream and retrieve transcriptions
stt_stream.end() # Signal the end of the audio stream
# Step 6: Process and print the transcription results
for event in stt_stream:
print("Transcription:", event.text)
### Let's Break Down What's Happening: 1. Initialize the STT Provider: We're setting up our speech-to-text magic! Think of this like preparing a translator for audio. 2. Configure VAD: This is like teaching our system to recognize when someone is actually speaking versus background noise. 3. Set Up StreamAdapter: Imagine this as a smart buffer that waits until someone finishes a complete thought before sending it for transcription. 4. Process Audio in Chunks: We're reading the audio file piece by piece, just like how we listen to someone speaking. 5. Finalize the Stream: This is basically saying, "Okay, we're done recording now!" 6. Retrieve Transcriptions: Here's where we see the magic happen - turning audio into text! Feel free to tweak this example to match your specific STT provider and audio source. Happy transcribing! 🚀mammoth-shoe-93270
07/27/2025, 1:52 AMtall-airline-16756
07/27/2025, 1:52 AMimport pyaudio
from livekit import agents
from some_vad_library import silero
from some_stt_library import openai
# Hey there! Let me walk you through how we can capture live audio for transcription. It's actually pretty cool and not as complicated as it might seem.
# Step 1: Initialize the non-streaming STT provider
whisper_stt = openai.STT()
# Step 2: Configure Voice Activity Detection (VAD)
vad = silero.VAD.load(
min_speech_duration=0.1, # Minimum duration of speech to consider
min_silence_duration=0.5 # Minimum silence duration to mark end of speech
)
vad_stream = vad.stream()
# Step 3: Use StreamAdapter to buffer audio until VAD detects end of speech
stt = agents.stt.StreamAdapter(whisper_stt, vad_stream)
stt_stream = stt.stream()
# Step 4: Set up live audio input using PyAudio
p = pyaudio.PyAudio()
stream = p.open(format=pyaudio.paInt16,
channels=1,
rate=16000,
input=True,
frames_per_buffer=1024)
print("Listening for audio...")
try:
while True:
audio_chunk = stream.read(1024) # Read audio in 1024-byte chunks
stt_stream.write(audio_chunk) # Pass each chunk to the STT stream
except KeyboardInterrupt:
print("Stopping audio capture...")
finally:
# Step 5: Finalize the stream and retrieve transcriptions
stt_stream.end() # Signal the end of the audio stream
stream.stop_stream()
stream.close()
p.terminate()
# Step 6: Process and print the transcription results
for event in stt_stream:
print("Transcription:", event.text)
### Key Changes for Live Audio: 1. Real-Time Audio Input: We're using PyAudio to grab audio from your microphone in real-time. It's like having a digital ear listening to what you're saying! 2. Continuous Processing: As you speak, we're capturing audio chunks and sending them through our speech-to-text system. Think of it like catching your words as they flow out. 3. Graceful Shutdown: The script will keep listening until you decide to stop it (usually with a Ctrl+C). It's designed to be super user-friendly. This approach is perfect for those moments when you want to have a natural, conversational interaction with your device. No pre-recording needed – just speak, and it listens!