How can I get word level transcript data from non-...
# ask-ai
a
How can I get word level transcript data from non-agent participants?
t
Thanks for reaching out, Anne. I'm an AI assistant and I'm here to help. Give me a moment while I research this topic.
To obtain word-level transcript data from non-agent participants in LiveKit, you can use the transcription functionality provided by LiveKit's client-side SDKs. Here's how you can achieve this: • Set Up Transcription Handling: Use the LiveKit client SDK in your application to listen for transcription events. This allows your frontend to receive transcription data as it becomes available. • Listen for Transcription Events: Implement an event listener for
RoomEvent.TranscriptionReceived
. This event is triggered whenever a transcription update is received. The event includes a list of
TranscriptionSegment
objects, which contain the transcribed text, timestamps, and whether the segment is final. • Extract Word-Level Data: Each
TranscriptionSegment
can represent a portion of the transcribed text. Depending on the STT service configuration, these segments might correspond to individual words, phrases, or sentences. To get word-level data, you may need to split the text into words and associate each word with its respective timestamp. • Handle Interim and Final Transcripts: Interim transcriptions (
SpeechEventType.INTERIM_TRANSCRIPT
) are sent as the user speaks, providing near-real-time updates. Final transcriptions (
SpeechEventType.FINAL_TRANSCRIPT
) are sent once the STT service has finalized a segment of speech. Use these events to build a comprehensive transcript. • Process and Store the Data: Store each word along with its start and end times. This allows for precise analysis or playback synchronization if needed. Let me know if you need further assistance with implementing this!
a
is there any way to achieve this from agent code?
t
Absolutely! You can get word-level transcript data from agent code in LiveKit by adding the
transcription_node
method to your agent. Here’s a friendly walkthrough: • Enable TTS-Aligned Transcription: When you start your
AgentSession
, set
use_tts_aligned_transcript=True
so the transcription timing matches the speech perfectly. • Implement the
transcription_node
Method:
Just define
async def transcription_node(...)
in your agent class. This lets you handle each
TimedString
chunk, which comes with
start_time
and
end_time
for every word. • Log or Use Word-Level Data: Inside
transcription_node
, you can check if a chunk is a
TimedString
and then grab its timing info to log or work with however you like. • Pick the Right TTS Providers: Providers like Cartesia or ElevenLabs support word-level timestamps, so they’re great if you want precise sync between audio and transcription. Here’s a quick example:
Copy code
async def transcription_node(
    self, text: AsyncIterable[str | TimedString], model_settings: ModelSettings
) -> AsyncGenerator[str | TimedString, None]:
    async for chunk in text:
        if isinstance(chunk, TimedString):
            <http://logger.info|logger.info>(f"TimedString: '{chunk.text}' (Start: {chunk.start_time}, End: {chunk.end_time})")
        yield chunk
Just a heads-up: word-level timing depends on your TTS provider. Cartesia and ElevenLabs are solid choices for this feature. If you have any questions or want a hand setting this up, feel free to ask—I’m here to help!
a
this works only for agent participants, correct?
t
Absolutely! The
transcription_node
method is made just for agent participants. It’s a handy part of the agent’s workflow that lets them tweak or fine-tune transcriptions before sharing them with clients. This helps agents manage and polish the transcriptions smoothly within LiveKit’s system. Sources: examples/voice_agents/structured_output.py | Text and transcriptions | LiveKit Docs | Transcription Applications | livekit/livekit_composite | DeepWiki
a
How can I get word level transcript data from non-agent participants from the agent code? I can access
session.history
from there.
t
If you want to grab word-level transcript data from non-agent participants using
session.history
in your agent code, here’s a friendly walkthrough to help you out: • Access Conversation History: Start by using
session.history
to pull the full conversation history, which includes messages from non-agent participants. Here’s how you can do it:
Copy code
# Access the conversation history
conversation_history = session.history

# Loop through the history to spot messages from non-agent participants
for item in conversation_history:
    if item.from_.identity != agent_participant.identity:
        # This message is from a non-agent participant
        transcript_data = item.message
        # Feel free to process the transcript data however you need
Listen for Transcription Events: You can also catch word-level transcript data as it happens by using the
user_input_transcribed
event:
Copy code
@session.on("user_input_transcribed")
async def on_transcript(transcript):
    if transcript.is_final:
        # This is the final transcript version
        print(f"User transcript: {transcript.transcript}")
Enable TTS-Aligned Transcriptions: If you want word-level timing details, just make sure to enable TTS-aligned transcriptions in your
AgentSession
setup like this:
Copy code
session = AgentSession(
    # ... other configurations ...
    use_tts_aligned_transcript=True,
)
If anything’s unclear or you want a hand with the implementation, just let me know—I’m here to help!