I am using google STT and in the metrics I get very high del LiveKit Community #ask-ai

I am using google STT and in the metrics I get ver...

square-jewelry-14040

08/20/2025, 3:44 PM

I am using google STT and in the metrics I get very high delays in End OF Utterance: EOU Delay: 1.2413s, Transcription Delay: 1.0779s Can you please help me by offering suggestions I can try to make it better? The STT model is being isntantiated as follows:

Copy code

google.STT(
                model=STTModels.GOOGLE_LATEST_LONG.value,
                languages=["en-US"],
                credentials_info=json.loads(settings.gcp_ai_services_credentials),
                use_streaming=True,
                # punctuate=False,
            )

tall-airline-16756

08/20/2025, 3:44 PM

Thanks for reaching out, Rida. I'm an AI assistant and I'm here to help. Give me a moment while I research this topic.

tall-airline-16756

08/20/2025, 3:45 PM

To help cut down on the End of Utterance (EOU) and Transcription delays when using Google STT, here are some friendly tips you can try: 1. Adjust Endpointing Parameters: • Try lowering

min_endpointing_delay

to 0.2s (it’s usually 0.5s by default). • Bring down

max_endpointing_delay

to 3.0s (instead of the default 6.0s). 2. Optimize Turn Detection: • Go with

turn_detection='vad'

for quicker detection compared to

'stt'

. 3. Configure Google STT for Lower Latency: • Switch over to the

'latest_short'

model—it’s faster! • Turn on

interim_results=True

to get feedback sooner. • Lower the

min_confidence_threshold

to 0.5 (default is 0.65) to speed things up. • Disable features you don’t need like

enable_word_time_offsets

and

enable_word_confidence

to keep things lean. 4. Additional Optimizations: • If you’re using VAD, try reducing

min_silence_duration

to 0.3s to catch silences faster. • Keep an eye on performance metrics to see how these tweaks are working. These little adjustments should help speed things up without losing too much accuracy. If you want to chat more or need a hand with anything else, just let me know! Sources: Capturing metrics | LiveKit Docs | How does end-of-utterance detection work in conversations? | Speech & Audio Processing | livekit/livekit_composite | DeepWiki | Agent Testing & Evaluation | livekit/livekit_composite | DeepWiki | livekit-agents/livekit/agents/voice/audio_recognition.py

square-jewelry-14040

08/20/2025, 3:58 PM

I cannot find enable_word_time_offsets and enable_word_confidence in google.STT. Also I couldnt find where i can set min_endpointing_delay and max_endpointing_delay. Please give me more details about these issues.

tall-airline-16756

08/20/2025, 3:59 PM

The parameters enable_word_time_offsets and enable_word_confidence are set up right in the Google STT constructor, while min_endpointing_delay and max_endpointing_delay get configured at the AgentSession level. Here’s a quick rundown on how you can set them up: 1. Google STT Configuration: • `enable_word_time_offsets`: Turns on word time offsets (default: True). • `enable_word_confidence`: Turns on word confidence scores (default: False). Example:

Copy code

from livekit.plugins import google

stt = google.STT(
    enable_word_time_offsets=True,
    enable_word_confidence=True,
    languages="en-US",
    model="latest_long"
)

2. Endpointing Delay Configuration: • `min_endpointing_delay`: The minimum wait time after detecting a turn end (default: 0.4s). • `max_endpointing_delay`: The maximum wait time for user input after a turn boundary (default: 6.0s). Example:

Copy code

from livekit.agents import AgentSession

session = AgentSession(
    stt=stt,  # Your configured Google STT
    min_endpointing_delay=0.5,  # Feel free to tweak this
    max_endpointing_delay=5.0   # Adjust as you see fit
)

If you have any questions or want me to walk you through anything, just let me know! I’m here to help. Sources: livekit-agents/livekit/agents/voice/audio_recognition.py | Turn detection and interruptions | LiveKit Docs | How does end-of-utterance detection work in conversations?

square-jewelry-14040

08/20/2025, 4:02 PM

in google.STT I cannot seems to find these paramters. This is how it is actually defined in the python sdk:

Copy code

class STT(stt.STT):
    def __init__(
        self,
        *,
        languages: LanguageCode = "en-US",  # Google STT can accept multiple languages
        detect_language: bool = True,
        interim_results: bool = True,
        punctuate: bool = True,
        spoken_punctuation: bool = False,
        model: SpeechModels | str = "latest_long",
        location: str = "global",
        sample_rate: int = 16000,
        min_confidence_threshold: float = _default_min_confidence,
        credentials_info: NotGivenOr[dict] = NOT_GIVEN,
        credentials_file: NotGivenOr[str] = NOT_GIVEN,
        keywords: NotGivenOr[list[tuple[str, float]]] = NOT_GIVEN,
        use_streaming: NotGivenOr[bool] = NOT_GIVEN,
    ):
        """
        Create a new instance of Google STT.

        Credentials must be provided, either by using the ``credentials_info`` dict, or reading
        from the file specified in ``credentials_file`` or via Application Default Credentials as
        described in <https://cloud.google.com/docs/authentication/application-default-credentials>

        args:
            languages(LanguageCode): list of language codes to recognize (default: "en-US")
            detect_language(bool): whether to detect the language of the audio (default: True)
            interim_results(bool): whether to return interim results (default: True)
            punctuate(bool): whether to punctuate the audio (default: True)
            spoken_punctuation(bool): whether to use spoken punctuation (default: False)
            model(SpeechModels): the model to use for recognition default: "latest_long"
            location(str): the location to use for recognition default: "global"
            sample_rate(int): the sample rate of the audio default: 16000
            min_confidence_threshold(float): minimum confidence threshold for recognition
            (default: 0.65)
            credentials_info(dict): the credentials info to use for recognition (default: None)
            credentials_file(str): the credentials file to use for recognition (default: None)
            keywords(List[tuple[str, float]]): list of keywords to recognize (default: None)
            use_streaming(bool): whether to use streaming for recognition (default: True)
        """

tall-airline-16756

08/20/2025, 4:03 PM

I don't have the answer you're looking for. You could also try asking your question: • in one of the other Slack channels or • to https://deepwiki.com/livekit/livekit_composite which is trained on all LiveKit source code If you find the answer, please post it here to help others!

Open in Slack

Previous Next