I noticed that the transcript for JSON does not ha...
# ask-ai
b
I noticed that the transcript for JSON does not have the timestamps on when the messages were sent. Is there any way to retrieve this? (edited)
t
Thanks for reaching out, Sebastian. I'm an AI assistant and I'm here to help. Give me a moment while I research this topic.
I'm happy to help you with LiveKit JSON transcripts and their timestamps! Here's how you can find them depending on the type of data: • Chat Messages: The
ChatMessage
object includes a
timestamp
field that tells you exactly when the message was created. This handy feature is available across all SDKs. • Transcriptions: For speech-to-text transcriptions, you’ll find timestamps at both the segment and word levels. Specifically, the
TranscriptionSegment
object has
start_time
and
end_time
fields to show the beginning and end of each transcribed segment. If you'd like, I’m here to guide you through accessing these timestamps in your particular SDK or use case—just let me know! Sources: livekit-protocol/livekit/protocol/models.pyi | packages/livekit-rtc/src/transcription.ts | lib/src/proto/livekit_models.pbjson.dart | livekit-ffi/src/livekit.proto.rs | Pipeline nodes and hooks | LiveKit Docs
b
Cool, awesome. Is there a way we can know the latency on each step in these messages? and it being divided by sst, llm, tts etc
t
LiveKit offers detailed latency metrics for each step in message processing to help you understand the performance better: • Speech-to-Text (STT): You'll find metrics like
duration
(the time it takes to create the transcript) and
agentsSttTtft
(the time until the final transcription is ready). • Large Language Model (LLM): Metrics include
ttft
(time to first token) and
duration
(the total time to generate the full completion). • Text-to-Speech (TTS): Here, you have
ttfb
(time to first byte of audio) and
duration
(how long the TTS request takes to complete). You can log these metrics in real time, aggregate them over a session, or even send them to external monitoring tools. If you'd like a hand with setting this up or accessing these metrics, just let me know—I’m here to help! Sources: Sources/LiveKit/Protos/livekit_metrics.pb.swift | livekit-agents/livekit/agents/metrics/utils.py | agents/src/metrics/utils.ts | Capturing metrics | LiveKit Docs | Capturing metrics | LiveKit Docs
b
Awesome, but could we have this detailed latency metric per transcription message?
t
Absolutely! LiveKit provides detailed latency metrics for each transcription message, which can be really helpful to understand the performance. These metrics cover: • STT (Speech-to-Text): You'll find
audio_duration
,
duration
(which tells you the time taken to create the transcript), and whether it was processed in streaming mode. • LLM: This includes
ttft
(time to first token),
duration
(the total time for completion), and token usage metrics. • TTS (Text-to-Speech): Here, you get
ttfb
(time to first byte) and
audioDuration
(duration of the generated audio). These metrics are sent as events during a session, and you can link them to specific transcription messages using fields like
requestId
and
timestamp
. If you'd like, I’m here to help you set this up or answer any questions you might have! Sources: Capturing metrics | LiveKit Docs | livekit-agents/livekit/agents/metrics/utils.py | Logs, metrics, and telemetry | LiveKit Docs | agents/src/metrics/utils.ts | metrics/metrics_tts.py
b
Where do I get that request ID?
t
I’m happy to help clarify how transcription messages work with LiveKit! For these messages, LiveKit uses a segment ID (
id
) as the unique identifier rather than a request ID. This
id
remains stable and consistent across both interim updates and final results, which makes it easier for you to track and manage transcription segments smoothly. For instance, if you’re working with the Swift SDK, you’ll find the
id
field right in the
TranscriptionSegment
object when handling transcription events. And if you’re using the JavaScript SDK, the
id
is available in the
TranscriptionReceived
event. If you’d like, I’m here to guide you through implementing this in your specific SDK—just let me know! Sources: Sources/LiveKit/Protos/livekit_models.pb.swift | Transcriptions | LiveKit Docs | livekit-agents/livekit/agents/voice/transcription/_utils.py | livekit-protocol/src/livekit.rs | protobufs/livekit_models.proto
b
Oh, but I wanna save this latest in the database. I'm asking for this in my Python agent.
t
I'd be happy to help you save transcription messages in your database using a Python agent! A great way to keep things organized is by using the
segment.id
as the unique identifier for each transcription segment. Here’s a simple approach you can follow: 1. In your Python agent, make sure to enable transcription handling by creating a method that processes each transcription segment. 2. For every
TranscriptionSegment
, gather the
id
,
text
,
start_time
,
end_time
, and
final
fields. 3. Save these details in your database, using the
id
as the primary key. This way, you ensure each entry is unique and can be updated if there are interim results. If you’d like, I can also provide a specific code example or help you with integrating this into your database setup. Just let me know—I’m here to support you!
b
Awesome, I already do this. Now, how can I connect the latency metrics to the transcript message or transcript segment?
t
To help you connect latency metrics to transcript messages in your Python agent, here’s a simple step-by-step guide: 1. Enable TTS-aligned transcripts: Just set
use_tts_aligned_transcript=True
in your
AgentSession
. This will give you access to word-level timestamps through
TimedString
objects. 2. Capture metrics: Use event listeners to gather STT, LLM, and TTS metrics. These include important fields like
speech_id
,
request_id
, and latency values such as
ttfb
and
duration
. 3. Correlate metrics: Connect these metrics to transcript segments by matching
speech_id
or
request_id
. For example, you can align
TimedString
timestamps with TTS metrics for detailed analysis. 4. Aggregate data: Use
metrics.UsageCollector
to gather and summarize latency metrics across your sessions. If you’d like, I’d be happy to provide a code example or walk you through the implementation—just let me know! Sources: Capturing metrics | LiveKit Docs | Text and transcriptions | LiveKit Docs | tests/test_agent_session.py
b
No, not quite. I mean, I have the transcript segments json here transcript = session.history.to_dict(exclude_timestamp=False) and the collected metrics: @session.on("metrics_collected") def _on_metrics_collected(ev): m = getattr(ev, "metrics", ev) collected_metrics.append(m) Now, how do I connect the latency To their respective transcript segments, or is there any better way of doing this?
t
I understand you're looking to connect collected metrics to specific transcript segments in LiveKit. While there isn't a direct documented method for this just yet, here's a friendly approach you might find helpful: 1. Track Events: Keep an eye on both
user_input_transcribed
(which gives you the transcript segments) and
metrics_collected
(which provides the metrics). 2. Correlate Identifiers: Try using shared identifiers like
request_id
or
speech_id
to link the metrics with the transcript segments. Just make sure these IDs stay consistent across the events. 3. Manual Mapping: You can create a mapping within your application logic to tie metrics to transcript segments, either by timing or these identifiers. If you'd like, I’m here to help you implement this or explore other options together. Just let me know! Sources: Logs, metrics, and telemetry | LiveKit Docs | metrics/send-metrics-to-3p/metrics_server/README.md | Text and transcriptions | LiveKit Docs
b
Cool. And do you recommend doing this, or is it better just to collect the metrics and display them instead?
t
Deciding whether to connect metrics to transcript segments or just collect and display them really depends on what you’re aiming to achieve: • Connect Metrics to Segments: If you’re looking for detailed insights into how each transcription segment is performing—maybe for debugging, optimization, or analytics—linking metrics to segments can give you that valuable level of detail. It’s especially helpful when you want to pinpoint specific bottlenecks in your pipeline. • Collect and Display Metrics: On the other hand, if your main goal is to keep an eye on overall system performance or share high-level metrics like average latency or total processing time, collecting and displaying aggregated metrics is a simpler and more efficient approach. If you’re thinking about real-time monitoring or showing metrics to users, aggregated data might be just what you need. But for deeper analysis or troubleshooting, connecting metrics to segments can really make a difference. Feel free to reach out if you’d like me to help you decide what’s best for your situation!
b
How do you suggest I calculate latency on each part of the call?
t
I'd be happy to help you calculate latency for each part of a call (STT, LLM, TTS) in LiveKit! Here are some useful metrics you can use: • STT (Speech-to-Text): Look at
audio_duration
(which shows the duration of the audio input) and
duration
(the time taken for transcription). • LLM (Large Language Model): Check out
ttft
(time to first token) and
duration
(total time for completion). • TTS (Text-to-Speech): Use
ttfb
(time to first byte) and
audio_duration
(duration of the generated audio). • End-of-Utterance (EOU) Delay: Consider
end_of_utterance_delay
(time from speech end to LLM start) and
transcription_delay
(time for final transcript). To find the total conversation latency, you can combine these metrics like this:
Copy code
total_latency = eou.end_of_utterance_delay + llm.ttft + tts.ttfb
Don't forget to subscribe to the
metrics_collected
event so you can capture these metrics in real time. If you'd like, I’m here to guide you through the implementation—just let me know! Sources: Capturing metrics | LiveKit Docs | Logs, metrics, and telemetry | LiveKit Docs
b
What is the difference between the STT delay and the end of utterance delay?
t
Thank you for your question! Let me help clarify the difference between STT delay and end-of-utterance (EOU) delay for you: • STT delay (transcription delay): This is the time it takes for the speech-to-text (STT) model to provide the final transcript after you’ve finished speaking. It’s focused just on how long the STT processing takes. • End-of-utterance (EOU) delay: This measures the entire time from when the Voice Activity Detection (VAD) recognizes that you’ve stopped talking to when the agent starts the next step (like LLM inference). It includes both the endpointing process and the transcription time. So, to put it simply, STT delay is part of the EOU delay, which also includes the time needed to make sure you’ve truly finished speaking. Sources: Capturing metrics | LiveKit Docs | Capturing metrics | LiveKit Docs | livekit-agents/livekit/agents/voice/audio_recognition.py
b
are llm.duration and tts.duration the whole duration? like literally the WHOLE duration?
t
I'm glad you asked about the
llm.duration
and
tts.duration
metrics! These help us understand the time taken during different parts of the process: • LLM Duration: This shows how long it took for the LLM to generate the full response, starting from when your prompt was sent to the model until the entire completion was ready. Just a heads-up, this doesn’t include any extra time spent on tool calls that the LLM might trigger. • TTS Duration: This tells us the total time the TTS model needed to create the full audio output from the text you provided. It doesn’t cover the time it takes to actually play back the audio. So, these metrics give us a clear picture of the processing time for each stage, but it’s important to remember they don’t capture the full end-to-end conversation latency. Other things like speech-to-text (STT), end-of-utterance detection, and network delays also play a role. Sources: Logs, metrics, and telemetry | LiveKit Docs | Capturing metrics | LiveKit Docs
b
is there any example of latency metrics retrieval?
t
I'd love to share a helpful example on how you can retrieve and log latency metrics using a LiveKit Python agent. Here it is:
Copy code
from livekit.agents.metrics import LLMMetrics, TTSMetrics, STTMetrics, EOUMetrics

# Handle LLM metrics (e.g., Time to First Token - TTFT)
def on_llm_metrics(metrics: LLMMetrics):
    print(f"LLM TTFT: {metrics.ttft:.4f}s, Duration: {metrics.duration:.4f}s")

# Handle TTS metrics (e.g., Time to First Byte - TTFB)
def on_tts_metrics(metrics: TTSMetrics):
    print(f"TTS TTFB: {metrics.ttfb:.4f}s, Duration: {metrics.duration:.4f}s")

# Handle STT metrics (e.g., transcription duration)
def on_stt_metrics(metrics: STTMetrics):
    print(f"STT Duration: {metrics.duration:.4f}s, Audio Duration: {metrics.audio_duration:.4f}s")

# Handle EOU metrics (e.g., transcription delay)
def on_eou_metrics(metrics: EOUMetrics):
    print(f"Transcription Delay: {metrics.transcription_delay:.4f}s, EOU Delay: {metrics.end_of_utterance_delay:.4f}s")
To capture these metrics in real time, you’ll want to subscribe to the
metrics_collected
event. And if you’re curious about calculating the total conversation latency, you can do it like this:
Copy code
total_latency = eou.end_of_utterance_delay + llm.ttft + tts.ttfb
If you’d like any assistance with implementing this or if you want to forward these metrics to an external system, please don’t hesitate to reach out. I’m here to help! Sources: Logs, metrics, and telemetry | LiveKit Docs | metrics/metrics_vad.py
b
I'm confused i just want the latency metrics by transcript message, anyway of doing that?
t
I'm glad you're looking into associating latency metrics with individual transcript messages! A great way to do this is by using the
transcription_node
method in your Python agent. When you enable
use_tts_aligned_transcript=True
in the
AgentSession
, your transcription outputs will include helpful timing information through the
TimedString
object. This gives you
start_time
and
end_time
for each word or sentence, making it easier to align latency metrics with specific parts of your transcript. Here’s a simple example to guide you:
Copy code
async def transcription_node(self, text: AsyncIterable[str | TimedString], model_settings: ModelSettings) -> AsyncGenerator[str | TimedString, None]:
    async for chunk in text:
        if isinstance(chunk, TimedString):
            <http://logger.info|logger.info>(f"TimedString: '{chunk}' ({chunk.start_time} - {chunk.end_time})")
            # Associate this chunk with relevant metrics (e.g., TTS or LLM events)
        yield chunk
You can also combine this approach with metrics like
STTMetrics
,
LLMMetrics
,
TTSMetrics
, and
EOUMetrics
to create a detailed timeline of your system’s performance for each transcript segment. If you’d like, I’m here to help you through the next steps or clarify anything else! Sources: metrics/metrics_llm.py | Text and transcriptions | LiveKit Docs | examples/voice_agents/timed_agent_transcript.py
b
Awesome, but yeah, I'm talking more about the different metrics for every part of the session: STT, LLM, TTS, EOU. for each transcript message, how can we do that?
t
To help you associate STT, LLM, TTS, and EOU metrics with each transcript message, here’s a simple step-by-step guide you can follow: 1. Subscribe to Metrics Events: Start by attaching event listeners to capture metrics for each component (STT, LLM, TTS, and EOU). For example:
Copy code
stt.on("metrics_collected", lambda e: self.on_stt_metrics(e))
llm.on("metrics_collected", lambda e: self.on_llm_metrics(e))
tts.on("metrics_collected", lambda e: self.on_tts_metrics(e))
stt.on("eou_metrics_collected", lambda e: self.on_eou_metrics(e))
2. Correlate Metrics: You can use shared identifiers like
speech_id
or
request_id
to link the metrics back to specific transcript messages. For example:
Copy code
async def on_stt_metrics(self, metrics):
    print(f"Transcript for speech '{metrics.speech_id}': Audio duration = {metrics.audio_duration}s")
3. Aggregate Metrics: The
UsageCollector
class is a great tool to help you track and summarize metrics across the entire session. 4. Calculate Total Latency: Finally, you can combine these metrics to compute the end-to-end latency:
total_latency = eou.end_of_utterance_delay + llm.ttft + tts.ttfb
This method will help you connect detailed latency metrics to each transcript message, making performance analysis much clearer. If you have any questions or need a hand with any part of this, just let me know—I’m here to help! Sources: Logs, metrics, and telemetry | LiveKit Docs | Capturing metrics | LiveKit Docs | complex-agents/turn-taking/agent.py | livekit-agents/livekit/agents/metrics/utils.py
To help you associate STT, LLM, TTS, and EOU metrics with each transcript message, here’s a friendly step-by-step guide you can follow: 1. Subscribe to Metrics Events: Start by attaching event listeners to capture metrics for each component (STT, LLM, TTS, EOU). For example:
Copy code
stt.on("metrics_collected", lambda e: self.on_stt_metrics(e))
llm.on("metrics_collected", lambda e: self.on_llm_metrics(e))
tts.on("metrics_collected", lambda e: self.on_tts_metrics(e))
stt.on("eou_metrics_collected", lambda e: self.on_eou_metrics(e))
2. Correlate Metrics: You can link the metrics to specific transcript messages using shared identifiers like
speech_id
(which you'll find in
EOUMetrics
and
STTMetrics
) or by using timestamps. 3. Aggregate Metrics: The
UsageCollector
class is a handy tool to track and summarize these metrics throughout your session. 4. Log or Forward Metrics: Finally, you can use the
log_metrics
helper to either log your metrics or send them to an external system for monitoring. For a quick example, you might calculate the total conversation latency like this:
Copy code
total_latency = eou.end_of_utterance_delay + llm.ttft + tts.ttfb
If you’d like me to walk you through any part of this or provide more code examples, just let me know—I’m here to help!
b
but my transcript doesnt have speech_id it has an id
t
I understand that it can be a bit tricky right now because there isn’t a direct way to link the
id
in transcript segments with the
speech_id
used in metrics like STT, LLM, TTS, and EOU. These IDs are generated separately and aren’t cross-referenced in the system, which can feel a little confusing. But don’t worry! Here are some helpful workarounds you can try: 1. Use Timing Correlation: You can match the timestamps of transcript segments with the metric events (like STT or EOU) that belong to the same session. This often helps align things nicely. 2. Log Both IDs: Another approach is to capture the
speech_id
from your metrics and the
id
from transcript events, then connect them during post-processing. 3. Custom Transcription Forwarding: If you’re comfortable with customization, you might override the default transcription behavior to include the
speech_id
directly in metadata or logs. 4. Use OpenTelemetry: Leveraging tracing attributes like
ATTR_SPEECH_ID
can also help align events throughout the agent lifecycle. If you’d like, I’m here to help you implement any of these options or discuss your specific use case in more detail. Just let me know, and we’ll work through it together! Sources: Logs, metrics, and telemetry | LiveKit Docs | pipeline-stt/diarization.py | livekit-agents/livekit/agents/voice/transcription/__init__.py | agents/src/metrics/base.ts