square-jewelry-14040
08/20/2025, 3:44 PMgoogle.STT(
model=STTModels.GOOGLE_LATEST_LONG.value,
languages=["en-US"],
credentials_info=json.loads(settings.gcp_ai_services_credentials),
use_streaming=True,
# punctuate=False,
)
tall-airline-16756
08/20/2025, 3:44 PMtall-airline-16756
08/20/2025, 3:45 PMmin_endpointing_delay
to 0.2s (it’s usually 0.5s by default).
• Bring down max_endpointing_delay
to 3.0s (instead of the default 6.0s).
2. Optimize Turn Detection:
• Go with turn_detection='vad'
for quicker detection compared to 'stt'
.
3. Configure Google STT for Lower Latency:
• Switch over to the 'latest_short'
model—it’s faster!
• Turn on interim_results=True
to get feedback sooner.
• Lower the min_confidence_threshold
to 0.5 (default is 0.65) to speed things up.
• Disable features you don’t need like enable_word_time_offsets
and enable_word_confidence
to keep things lean.
4. Additional Optimizations:
• If you’re using VAD, try reducing min_silence_duration
to 0.3s to catch silences faster.
• Keep an eye on performance metrics to see how these tweaks are working.
These little adjustments should help speed things up without losing too much accuracy. If you want to chat more or need a hand with anything else, just let me know! Sources: Capturing metrics | LiveKit Docs | How does end-of-utterance detection work in conversations? | Speech & Audio Processing | livekit/livekit_composite | DeepWiki | Agent Testing & Evaluation | livekit/livekit_composite | DeepWiki | livekit-agents/livekit/agents/voice/audio_recognition.pysquare-jewelry-14040
08/20/2025, 3:58 PMtall-airline-16756
08/20/2025, 3:59 PMfrom livekit.plugins import google
stt = google.STT(
enable_word_time_offsets=True,
enable_word_confidence=True,
languages="en-US",
model="latest_long"
)
2. Endpointing Delay Configuration:
• `min_endpointing_delay`: The minimum wait time after detecting a turn end (default: 0.4s).
• `max_endpointing_delay`: The maximum wait time for user input after a turn boundary (default: 6.0s).
Example:
from livekit.agents import AgentSession
session = AgentSession(
stt=stt, # Your configured Google STT
min_endpointing_delay=0.5, # Feel free to tweak this
max_endpointing_delay=5.0 # Adjust as you see fit
)
If you have any questions or want me to walk you through anything, just let me know! I’m here to help. Sources: livekit-agents/livekit/agents/voice/audio_recognition.py | Turn detection and interruptions | LiveKit Docs | How does end-of-utterance detection work in conversations?square-jewelry-14040
08/20/2025, 4:02 PMclass STT(stt.STT):
def __init__(
self,
*,
languages: LanguageCode = "en-US", # Google STT can accept multiple languages
detect_language: bool = True,
interim_results: bool = True,
punctuate: bool = True,
spoken_punctuation: bool = False,
model: SpeechModels | str = "latest_long",
location: str = "global",
sample_rate: int = 16000,
min_confidence_threshold: float = _default_min_confidence,
credentials_info: NotGivenOr[dict] = NOT_GIVEN,
credentials_file: NotGivenOr[str] = NOT_GIVEN,
keywords: NotGivenOr[list[tuple[str, float]]] = NOT_GIVEN,
use_streaming: NotGivenOr[bool] = NOT_GIVEN,
):
"""
Create a new instance of Google STT.
Credentials must be provided, either by using the ``credentials_info`` dict, or reading
from the file specified in ``credentials_file`` or via Application Default Credentials as
described in <https://cloud.google.com/docs/authentication/application-default-credentials>
args:
languages(LanguageCode): list of language codes to recognize (default: "en-US")
detect_language(bool): whether to detect the language of the audio (default: True)
interim_results(bool): whether to return interim results (default: True)
punctuate(bool): whether to punctuate the audio (default: True)
spoken_punctuation(bool): whether to use spoken punctuation (default: False)
model(SpeechModels): the model to use for recognition default: "latest_long"
location(str): the location to use for recognition default: "global"
sample_rate(int): the sample rate of the audio default: 16000
min_confidence_threshold(float): minimum confidence threshold for recognition
(default: 0.65)
credentials_info(dict): the credentials info to use for recognition (default: None)
credentials_file(str): the credentials file to use for recognition (default: None)
keywords(List[tuple[str, float]]): list of keywords to recognize (default: None)
use_streaming(bool): whether to use streaming for recognition (default: True)
"""
tall-airline-16756
08/20/2025, 4:03 PM