can override the stt node and send the audio data to any ext LiveKit Community #ask-ai

Join Slack

can override the stt node and send the audio data ...

# ask-ai

big-controller-60015

09/14/2025, 5:22 PM

can override the stt node and send the audio data to any external services like azure pronounciation assessments?

tall-airline-16756

09/14/2025, 5:22 PM

Thanks for reaching out, Mohammed. I'm an AI assistant and I'm here to help. Give me a moment while I research this topic.

tall-airline-16756

09/14/2025, 5:23 PM

Absolutely! LiveKit does let you override the STT node to send audio data to external services, including Azure Pronunciation Assessments. You can do this by customizing the

stt_node()

method within a custom

Agent

subclass. This approach allows you to capture and handle audio frames, send them to external services, and even process the results if you wish. Here’s a simple way to get started: 1. Override the

stt_node()

method in your

Agent

class. 2. Extract the audio stream and forward it to the Azure Pronunciation Assessment API. 3. Optionally, process the results like fluency or accuracy scores. 4. Then continue with transcription using the default or your custom STT logic. If you'd like, I’d be happy to provide a code example to show exactly how to override

stt_node()

and send audio data to an external service like Azure. Just let me know! Sources: Pipeline nodes and hooks | LiveKit Docs | Azure Speech STT integration guide | LiveKit Docs

big-controller-60015

09/15/2025, 6:58 AM

1. Extract the audio stream and forward it to the Azure Pronunciation Assessment API. how can i do this?

tall-airline-16756

09/15/2025, 6:59 AM

I'm happy to guide you through extracting the audio stream and forwarding it to the Azure Pronunciation Assessment API. Here’s a simple way to approach it: 1. Extract Audio Stream: You can use

rtc.AudioStream

to access raw audio frames from a participant’s track in real time. For example:

Copy code

audio_stream = rtc.AudioStream(track)
async for frame_event in audio_stream:
    frame = frame_event.frame  # rtc.AudioFrame
    audio_data = frame.data.tobytes()  # Raw PCM audio data
    # Forward audio_data to Azure

Just make sure the audio format matches Azure’s requirements (16 kHz, PCM 16-bit, mono) for the best results. 1. Forward Audio to Azure: Using the Azure Speech SDK, you can send the audio data for pronunciation assessment. Here’s an example to get you started:

Copy code

import azure.cognitiveservices.speech as speechsdk

def setup_pronunciation_assessment(speech_key, speech_region, reference_text):
    speech_config = speechsdk.SpeechConfig(subscription=speech_key, region=speech_region)
    audio_config = speechsdk.audio.AudioConfig(use_default_microphone=False)

    # Enable Pronunciation Assessment
    pronunciation_config = speechsdk.PronunciationAssessmentConfig(
        reference_text=reference_text,
        grading_system=speechsdk.PronunciationAssessmentGradingSystem.HundredMark,
        granularity=speechsdk.PronunciationAssessmentGranularity.Phoneme
    )

    speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_config)
    pronunciation_config.apply_to(speech_recognizer)

    speech_recognizer.recognized.connect(lambda evt: print(f"Assessment: {evt.result.properties.get_property(speechsdk.PropertyId.SpeechServiceResponse_JsonResult)}"))

    return speech_recognizer

With this setup, you’ll be able to forward the extracted audio data to Azure’s Pronunciation Assessment API using the Speech SDK. You’ll want to buffer the audio frames and send them in chunks that fit Azure’s requirements. If you’d like, I’d be more than happy to provide a complete example of how to integrate this into a LiveKit agent. Just let me know! Sources: Azure Speech STT integration guide | LiveKit Docs | livekit-plugins/livekit-plugins-deepgram/livekit/plugins/deepgram/stt.py | main.py

Open in Slack

Previous Next