I want to process the transcript coming from STT b...
# ask-ai
m
I want to process the transcript coming from STT before passing it to the LLM and then pass the processed transcript. I was thinking that I should do this in stt_node(). How can I do this?
t
Thanks for reaching out, Mridul. I'm an AI assistant and I'm here to help. Give me a moment while I research this topic.
To process a transcript in the
stt_node
before passing it to the LLM in LiveKit, you can override the
stt_node
method in your custom Agent class. Here’s an example:
Copy code
from livekit import rtc
from livekit.agents import ModelSettings, stt, Agent
from typing import AsyncIterable, Optional

async def stt_node(
    self, audio: AsyncIterable[rtc.AudioFrame], model_settings: ModelSettings
) -> Optional[AsyncIterable[stt.SpeechEvent]]:
    async def process_speech_events():
        async for event in Agent.default.stt_node(self, audio, model_settings):
            # Process the transcript here before passing to LLM
            # For example, remove filler words, fix formatting, etc.
            if event.transcript:
                event.transcript = event.transcript.replace("um", "").replace("uh", "")
                # Add any other text processing you need
            yield event
    
    return process_speech_events()
This example demonstrates how to modify the transcript by removing filler words like "um" and "uh" before the text is sent to the LLM. You can implement any additional text processing logic you need in this node. Sources: Voice pipeline nodes | LiveKit Docs
m
So basically I have implemented a function which takes in the stt transcripts and returns the processed transcript. I want the processed transcript to be passed in LLM. How can I do this?
t
To pass your processed transcript to the LLM, you can override the
transcription_node
in your custom Agent class. This allows you to modify or clean up the transcription text before it is sent to the LLM. Here’s an example:
Copy code
from livekit.agents import ModelSettings
from typing import AsyncIterable

async def transcription_node(self, text: AsyncIterable[str], model_settings: ModelSettings) -> AsyncIterable[str]:
    async for delta in text:
        # Process the transcript here
        processed_text = your_processing_function(delta)
        yield processed_text
In this example, replace
your_processing_function
with the function you’ve implemented to process the STT transcripts. This ensures that the processed transcript is passed to the LLM. Sources: Voice pipeline nodes | LiveKit Docs
m
What is delta? Is it the incomplete transcript? I want to pass the final transcript to the processing_function
t
In the context of the
transcription_node
, the term "delta" refers to the incremental text segments that are being processed through the transcription pipeline. These segments may be incomplete transcriptions as they are streamed in real-time. If you want to process only the final transcript, you can modify your implementation to collect all the segments and process them once the transcription is complete. Here’s an example:
Copy code
from livekit.agents import ModelSettings
from typing import AsyncIterable

async def transcription_node(self, text: AsyncIterable[str], model_settings: ModelSettings) -> AsyncIterable[str]:
    final_transcript = ""
    async for delta in text:
        final_transcript += delta
    
    # Process the final transcript
    processed_text = your_processing_function(final_transcript)
    yield processed_text
In this example, the
final_transcript
variable accumulates all the text segments, and once the transcription is complete, it is passed to your processing function. This ensures that only the final transcript is processed. Sources: Voice pipeline nodes | LiveKit Docs
m
I noticed that trancription_node only captures what comes out of the LLM. I want to control what goes in the LLM (basically from the stt)
t
To control what goes into the LLM from the STT, you can override the
stt_node
in your custom Agent class. This allows you to process the STT output before it is sent to the LLM. Here’s an example:
Copy code
from livekit import rtc
from livekit.agents import ModelSettings, stt, Agent
from typing import AsyncIterable, Optional

async def stt_node(
    self, audio: AsyncIterable[rtc.AudioFrame], model_settings: ModelSettings
) -> Optional[AsyncIterable[stt.SpeechEvent]]:
    async def process_speech_events():
        async for event in Agent.default.stt_node(self, audio, model_settings):
            # Process the transcript here before passing to LLM
            if event.transcript:
                event.transcript = your_processing_function(event.transcript)
            yield event
    
    return process_speech_events()
In this example, replace
your_processing_function
with the function you’ve implemented to process the STT transcripts. This ensures that the processed transcript is passed to the LLM. Additionally, you can use manual turn control to explicitly manage when user input is sent to the LLM. Methods like
session.interrupt()
,
session.clear_user_turn()
, and
session.commit_user_turn()
can help you control the flow of input. Sources: Speech-to-text (STT) integrations | LiveKit Docs | Tree View: | dse-workbench | Turn detection and interruptions | LiveKit Docs | Tree View: | dse-workbench
m
It gives me this error: AttributeError: 'SpeechEvent' object has no attribute 'transcript'
t
The error occurs because the
SpeechEvent
object does not have a direct
transcript
attribute. Instead, the transcript information is contained within the
alternatives
array of the
SpeechEvent
object. You can access the transcript using the following approach:
Copy code
async def stt_node(
    self, audio: AsyncIterable[rtc.AudioFrame], model_settings: ModelSettings
) -> Optional[AsyncIterable[stt.SpeechEvent]]:
    async def process_speech_events():
        async for event in Agent.default.stt_node(self, audio, model_settings):
            # Access the transcript from the alternatives array
            if event.alternatives and event.alternatives[0].text:
                processed_text = your_processing_function(event.alternatives[0].text)
                event.alternatives[0].text = processed_text
            yield event
    
    return process_speech_events()
In this example, the transcript is accessed via
event.alternatives[0].text
, and you can apply your processing function to modify it before passing it to the LLM. Sources: livekit.agents.pipeline.pipeline_agent API documentation | SpeechEvent | LiveKit Agents