I am using google STT and in the metrics I get ver...
# ask-ai
s
I am using google STT and in the metrics I get very high delays in End OF Utterance: EOU Delay: 1.2413s, Transcription Delay: 1.0779s Can you please help me by offering suggestions I can try to make it better? The STT model is being isntantiated as follows:
Copy code
google.STT(
                model=STTModels.GOOGLE_LATEST_LONG.value,
                languages=["en-US"],
                credentials_info=json.loads(settings.gcp_ai_services_credentials),
                use_streaming=True,
                # punctuate=False,
            )
t
Thanks for reaching out, Rida. I'm an AI assistant and I'm here to help. Give me a moment while I research this topic.
To help cut down on the End of Utterance (EOU) and Transcription delays when using Google STT, here are some friendly tips you can try: 1. Adjust Endpointing Parameters: • Try lowering
min_endpointing_delay
to 0.2s (it’s usually 0.5s by default). • Bring down
max_endpointing_delay
to 3.0s (instead of the default 6.0s). 2. Optimize Turn Detection: • Go with
turn_detection='vad'
for quicker detection compared to
'stt'
. 3. Configure Google STT for Lower Latency: • Switch over to the
'latest_short'
model—it’s faster! • Turn on
interim_results=True
to get feedback sooner. • Lower the
min_confidence_threshold
to 0.5 (default is 0.65) to speed things up. • Disable features you don’t need like
enable_word_time_offsets
and
enable_word_confidence
to keep things lean. 4. Additional Optimizations: • If you’re using VAD, try reducing
min_silence_duration
to 0.3s to catch silences faster. • Keep an eye on performance metrics to see how these tweaks are working. These little adjustments should help speed things up without losing too much accuracy. If you want to chat more or need a hand with anything else, just let me know! Sources: Capturing metrics | LiveKit Docs | How does end-of-utterance detection work in conversations? | Speech & Audio Processing | livekit/livekit_composite | DeepWiki | Agent Testing & Evaluation | livekit/livekit_composite | DeepWiki | livekit-agents/livekit/agents/voice/audio_recognition.py
s
I cannot find enable_word_time_offsets and enable_word_confidence in google.STT. Also I couldnt find where i can set min_endpointing_delay and max_endpointing_delay. Please give me more details about these issues.
t
The parameters enable_word_time_offsets and enable_word_confidence are set up right in the Google STT constructor, while min_endpointing_delay and max_endpointing_delay get configured at the AgentSession level. Here’s a quick rundown on how you can set them up: 1. Google STT Configuration: • `enable_word_time_offsets`: Turns on word time offsets (default: True). • `enable_word_confidence`: Turns on word confidence scores (default: False). Example:
Copy code
from livekit.plugins import google

stt = google.STT(
    enable_word_time_offsets=True,
    enable_word_confidence=True,
    languages="en-US",
    model="latest_long"
)
2. Endpointing Delay Configuration: • `min_endpointing_delay`: The minimum wait time after detecting a turn end (default: 0.4s). • `max_endpointing_delay`: The maximum wait time for user input after a turn boundary (default: 6.0s). Example:
Copy code
from livekit.agents import AgentSession

session = AgentSession(
    stt=stt,  # Your configured Google STT
    min_endpointing_delay=0.5,  # Feel free to tweak this
    max_endpointing_delay=5.0   # Adjust as you see fit
)
If you have any questions or want me to walk you through anything, just let me know! I’m here to help. Sources: livekit-agents/livekit/agents/voice/audio_recognition.py | Turn detection and interruptions | LiveKit Docs | How does end-of-utterance detection work in conversations?
s
in google.STT I cannot seems to find these paramters. This is how it is actually defined in the python sdk:
Copy code
class STT(stt.STT):
    def __init__(
        self,
        *,
        languages: LanguageCode = "en-US",  # Google STT can accept multiple languages
        detect_language: bool = True,
        interim_results: bool = True,
        punctuate: bool = True,
        spoken_punctuation: bool = False,
        model: SpeechModels | str = "latest_long",
        location: str = "global",
        sample_rate: int = 16000,
        min_confidence_threshold: float = _default_min_confidence,
        credentials_info: NotGivenOr[dict] = NOT_GIVEN,
        credentials_file: NotGivenOr[str] = NOT_GIVEN,
        keywords: NotGivenOr[list[tuple[str, float]]] = NOT_GIVEN,
        use_streaming: NotGivenOr[bool] = NOT_GIVEN,
    ):
        """
        Create a new instance of Google STT.

        Credentials must be provided, either by using the ``credentials_info`` dict, or reading
        from the file specified in ``credentials_file`` or via Application Default Credentials as
        described in <https://cloud.google.com/docs/authentication/application-default-credentials>

        args:
            languages(LanguageCode): list of language codes to recognize (default: "en-US")
            detect_language(bool): whether to detect the language of the audio (default: True)
            interim_results(bool): whether to return interim results (default: True)
            punctuate(bool): whether to punctuate the audio (default: True)
            spoken_punctuation(bool): whether to use spoken punctuation (default: False)
            model(SpeechModels): the model to use for recognition default: "latest_long"
            location(str): the location to use for recognition default: "global"
            sample_rate(int): the sample rate of the audio default: 16000
            min_confidence_threshold(float): minimum confidence threshold for recognition
            (default: 0.65)
            credentials_info(dict): the credentials info to use for recognition (default: None)
            credentials_file(str): the credentials file to use for recognition (default: None)
            keywords(List[tuple[str, float]]): list of keywords to recognize (default: None)
            use_streaming(bool): whether to use streaming for recognition (default: True)
        """
t
I don't have the answer you're looking for. You could also try asking your question: • in one of the other Slack channels or • to https://deepwiki.com/livekit/livekit_composite which is trained on all LiveKit source code If you find the answer, please post it here to help others!