How can I get word level transcript data from non agent part LiveKit Community #ask-ai

Join Slack

How can I get word level transcript data from non-...

# ask-ai

acceptable-dream-85180

08/07/2025, 3:20 PM

How can I get word level transcript data from non-agent participants?

tall-airline-16756

08/07/2025, 3:20 PM

Thanks for reaching out, Anne. I'm an AI assistant and I'm here to help. Give me a moment while I research this topic.

tall-airline-16756

08/07/2025, 3:20 PM

To obtain word-level transcript data from non-agent participants in LiveKit, you can use the transcription functionality provided by LiveKit's client-side SDKs. Here's how you can achieve this: • Set Up Transcription Handling: Use the LiveKit client SDK in your application to listen for transcription events. This allows your frontend to receive transcription data as it becomes available. • Listen for Transcription Events: Implement an event listener for

RoomEvent.TranscriptionReceived

. This event is triggered whenever a transcription update is received. The event includes a list of

TranscriptionSegment

objects, which contain the transcribed text, timestamps, and whether the segment is final. • Extract Word-Level Data: Each

TranscriptionSegment

can represent a portion of the transcribed text. Depending on the STT service configuration, these segments might correspond to individual words, phrases, or sentences. To get word-level data, you may need to split the text into words and associate each word with its respective timestamp. • Handle Interim and Final Transcripts: Interim transcriptions (

SpeechEventType.INTERIM_TRANSCRIPT

) are sent as the user speaks, providing near-real-time updates. Final transcriptions (

SpeechEventType.FINAL_TRANSCRIPT

) are sent once the STT service has finalized a segment of speech. Use these events to build a comprehensive transcript. • Process and Store the Data: Store each word along with its start and end times. This allows for precise analysis or playback synchronization if needed. Let me know if you need further assistance with implementing this!

acceptable-dream-85180

08/07/2025, 3:23 PM

is there any way to achieve this from agent code?

tall-airline-16756

08/07/2025, 3:23 PM

Absolutely! You can get word-level transcript data from agent code in LiveKit by adding the

transcription_node

method to your agent. Here’s a friendly walkthrough: • Enable TTS-Aligned Transcription: When you start your

AgentSession

, set

use_tts_aligned_transcript=True

so the transcription timing matches the speech perfectly. • Implement the
transcription_node
Method: Just define

async def transcription_node(...)

in your agent class. This lets you handle each

TimedString

chunk, which comes with

start_time

and

end_time

for every word. • Log or Use Word-Level Data: Inside

transcription_node

, you can check if a chunk is a

TimedString

and then grab its timing info to log or work with however you like. • Pick the Right TTS Providers: Providers like Cartesia or ElevenLabs support word-level timestamps, so they’re great if you want precise sync between audio and transcription. Here’s a quick example:

Copy code

async def transcription_node(
    self, text: AsyncIterable[str | TimedString], model_settings: ModelSettings
) -> AsyncGenerator[str | TimedString, None]:
    async for chunk in text:
        if isinstance(chunk, TimedString):
            <http://logger.info|logger.info>(f"TimedString: '{chunk.text}' (Start: {chunk.start_time}, End: {chunk.end_time})")
        yield chunk

Just a heads-up: word-level timing depends on your TTS provider. Cartesia and ElevenLabs are solid choices for this feature. If you have any questions or want a hand setting this up, feel free to ask—I’m here to help!

acceptable-dream-85180

08/07/2025, 3:24 PM

this works only for agent participants, correct?

tall-airline-16756

08/07/2025, 3:25 PM

Absolutely! The

transcription_node

method is made just for agent participants. It’s a handy part of the agent’s workflow that lets them tweak or fine-tune transcriptions before sharing them with clients. This helps agents manage and polish the transcriptions smoothly within LiveKit’s system. Sources: examples/voice_agents/structured_output.py | Text and transcriptions | LiveKit Docs | Transcription Applications | livekit/livekit_composite | DeepWiki

acceptable-dream-85180

08/07/2025, 3:26 PM

How can I get word level transcript data from non-agent participants from the agent code? I can access

session.history

from there.

tall-airline-16756

08/07/2025, 3:26 PM

If you want to grab word-level transcript data from non-agent participants using

session.history

in your agent code, here’s a friendly walkthrough to help you out: • Access Conversation History: Start by using

session.history

to pull the full conversation history, which includes messages from non-agent participants. Here’s how you can do it:

Copy code

# Access the conversation history
conversation_history = session.history

# Loop through the history to spot messages from non-agent participants
for item in conversation_history:
    if item.from_.identity != agent_participant.identity:
        # This message is from a non-agent participant
        transcript_data = item.message
        # Feel free to process the transcript data however you need

• Listen for Transcription Events: You can also catch word-level transcript data as it happens by using the

user_input_transcribed

event:

Copy code

@session.on("user_input_transcribed")
async def on_transcript(transcript):
    if transcript.is_final:
        # This is the final transcript version
        print(f"User transcript: {transcript.transcript}")

• Enable TTS-Aligned Transcriptions: If you want word-level timing details, just make sure to enable TTS-aligned transcriptions in your

AgentSession

setup like this:

Copy code

session = AgentSession(
    # ... other configurations ...
    use_tts_aligned_transcript=True,
)

If anything’s unclear or you want a hand with the implementation, just let me know—I’m here to help!

Open in Slack

Previous Next