can override the stt node and send the audio data ...
# ask-ai
b
can override the stt node and send the audio data to any external services like azure pronounciation assessments?
t
Thanks for reaching out, Mohammed. I'm an AI assistant and I'm here to help. Give me a moment while I research this topic.
Absolutely! LiveKit does let you override the STT node to send audio data to external services, including Azure Pronunciation Assessments. You can do this by customizing the
stt_node()
method within a custom
Agent
subclass. This approach allows you to capture and handle audio frames, send them to external services, and even process the results if you wish. Here’s a simple way to get started: 1. Override the
stt_node()
method in your
Agent
class. 2. Extract the audio stream and forward it to the Azure Pronunciation Assessment API. 3. Optionally, process the results like fluency or accuracy scores. 4. Then continue with transcription using the default or your custom STT logic. If you'd like, I’d be happy to provide a code example to show exactly how to override
stt_node()
and send audio data to an external service like Azure. Just let me know! Sources: Pipeline nodes and hooks | LiveKit Docs | Azure Speech STT integration guide | LiveKit Docs
b
1. Extract the audio stream and forward it to the Azure Pronunciation Assessment API. how can i do this?
t
I'm happy to guide you through extracting the audio stream and forwarding it to the Azure Pronunciation Assessment API. Here’s a simple way to approach it: 1. Extract Audio Stream: You can use
rtc.AudioStream
to access raw audio frames from a participant’s track in real time. For example:
Copy code
audio_stream = rtc.AudioStream(track)
async for frame_event in audio_stream:
    frame = frame_event.frame  # rtc.AudioFrame
    audio_data = frame.data.tobytes()  # Raw PCM audio data
    # Forward audio_data to Azure
Just make sure the audio format matches Azure’s requirements (16 kHz, PCM 16-bit, mono) for the best results. 1. Forward Audio to Azure: Using the Azure Speech SDK, you can send the audio data for pronunciation assessment. Here’s an example to get you started:
Copy code
import azure.cognitiveservices.speech as speechsdk

def setup_pronunciation_assessment(speech_key, speech_region, reference_text):
    speech_config = speechsdk.SpeechConfig(subscription=speech_key, region=speech_region)
    audio_config = speechsdk.audio.AudioConfig(use_default_microphone=False)

    # Enable Pronunciation Assessment
    pronunciation_config = speechsdk.PronunciationAssessmentConfig(
        reference_text=reference_text,
        grading_system=speechsdk.PronunciationAssessmentGradingSystem.HundredMark,
        granularity=speechsdk.PronunciationAssessmentGranularity.Phoneme
    )

    speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_config)
    pronunciation_config.apply_to(speech_recognizer)

    speech_recognizer.recognized.connect(lambda evt: print(f"Assessment: {evt.result.properties.get_property(speechsdk.PropertyId.SpeechServiceResponse_JsonResult)}"))

    return speech_recognizer
With this setup, you’ll be able to forward the extracted audio data to Azure’s Pronunciation Assessment API using the Speech SDK. You’ll want to buffer the audio frames and send them in chunks that fit Azure’s requirements. If you’d like, I’d be more than happy to provide a complete example of how to integrate this into a LiveKit agent. Just let me know! Sources: Azure Speech STT integration guide | LiveKit Docs | livekit-plugins/livekit-plugins-deepgram/livekit/plugins/deepgram/stt.py | main.py