Is there any way to retrieve the latency metrics p...
# ask-ai
b
Is there any way to retrieve the latency metrics per transcript segment and for it to be aggregated and organized?
t
Thanks for reaching out, Sebastian. I'm an AI assistant and I'm here to help. Give me a moment while I research this topic.
I'm happy to help you retrieve latency metrics per transcript segment by using the metrics events emitted during agent sessions. Here's a quick overview: • TTSMetrics: This includes a
segment_id
field that uniquely identifies audio segments, along with useful metrics like
ttfb
(time to first byte) and
duration
. • STTMetrics: Provides
request_id
and
speech_id
to help correlate transcription results with user utterances. • EOUMetrics: Captures delays such as
transcription_delay
and
end_of_utterance_delay
, which are really helpful for understanding timing. You can subscribe to the
metrics_collected
event to log and gather these metrics. For example:
Copy code
@agent.on("metrics_collected")
def on_metrics_collected(metrics: AgentMetrics):
    if isinstance(metrics, TTSMetrics) and metrics.segment_id:
        <http://logger.info|logger.info>(f"Segment {metrics.segment_id}: TTFB = {metrics.ttfb}s, Duration = {metrics.duration}s")
To get a full picture across a session, you might want to use the
UsageCollector
class or summarize the data manually. For end-to-end latency, combining metrics like
end_of_utterance_delay
,
ttft
(LLM), and
ttfb
(TTS) can be very insightful. If you’d like, I’m here to walk you through any part of this or help with the implementation. Just let me know! Sources: livekit-agents/livekit/agents/voice/audio_recognition.py | Logs, metrics, and telemetry | LiveKit Docs | Capturing metrics | LiveKit Docs | Capturing metrics | LiveKit Docs | Text and transcriptions | LiveKit Docs
I'm happy to help you retrieve latency metrics per transcript segment! You can do this by using the metrics events emitted during agent sessions. Each
TTSMetrics
object includes a
segment_id
field, which uniquely identifies a generated audio segment, along with useful metrics like
ttfb
(time to first byte) and
duration
. Similarly,
STTMetrics
and
EOUMetrics
offer timing details for transcription and end-of-utterance delays. To bring these metrics together, you might use the
UsageCollector
class or manually collect and summarize the data. For instance, you can subscribe to the
metrics_collected
event and then log or process the metrics based on
segment_id
or
request_id
. If you'd like, I’m here to provide more details or examples—just let me know, and I’ll gladly guide you through it!
b
does the usage collector let me see the latency metrics per turn the agent spoke?
t
I understand you're looking to view latency metrics per agent turn. The
UsageCollector
is actually designed to aggregate usage data like token counts and API usage across sessions, so it doesn't support per-turn latency metrics directly. But don't worry! You can still capture latency for each turn by subscribing to the
metrics_collected
event and using the
speech_id
field to correlate related metrics. This way, you can group together metrics like
LLMMetrics
,
TTSMetrics
, and
EOUMetrics
for every turn. To get the total latency per turn, simply add up these values: •
end_of_utterance_delay
(EOUMetrics) •
ttft
(time to first token, LLMMetrics) •
ttfb
(time to first byte, TTSMetrics) If you'd like, I’m here to help with more detailed guidance or examples on how to implement this—just let me know! Sources: examples/src/cartersia_tts.ts | Capturing metrics | LiveKit Docs | Capturing metrics | LiveKit Docs | Logs, metrics, and telemetry | LiveKit Docs | tests/test_agent_session.py
I understand you're looking to view latency metrics per agent turn. The
UsageCollector
is mainly designed to aggregate usage data like token counts and API usage across sessions, so it doesn’t support per-turn latency metrics directly. But don’t worry! You can still capture latency for each turn by subscribing to the
metrics_collected
event and using the
speech_id
field to correlate the metrics. This way, you can group together important metrics like
end_of_utterance_delay
(EOUMetrics),
ttft
(LLMMetrics), and
ttfb
(TTSMetrics) for every turn. To find the total latency per turn, you just need to add these up: •
total_latency = end_of_utterance_delay + ttft + ttfb
If you’d like, I’m here to help with more detailed guidance or examples on how to implement this—just let me know!
b
My issue with this is that the metrics are really unorganized and they don't follow a flow where it's obvious how to aggregate them. You can have 4 speech-to-text for one LLM at least from what I saw, it's very unorganized. How would you approach that?
t
You're absolutely right—metrics can sometimes feel a bit scattered, especially when multiple STT events relate to a single LLM response. Let's work together to bring some clarity by using a structured approach to correlate and aggregate these metrics effectively: 1. Group Metrics by Session and Speech ID: Try using the
speech_id
field to group related metrics. This field stays consistent across STT, LLM, and TTS events within one user turn, making it a great anchor point. 2. Track Multiple STT Events: When you have multiple STT events for one LLM response, keep a list of STT metrics under the same
speech_id
. You can then aggregate these (like summing durations) to get the total STT processing time. 3. Define a Processing Flow: Setting up a clear processing pipeline can really help. For example: ◦ Start by aggregating STT metrics grouped by
speech_id
. ◦ Then, connect those aggregated STT metrics to the matching LLM metrics using the same
speech_id
. ◦ Finally, link the LLM metrics with the TTS metrics to complete the user turn. 1. Use a Data Structure: Consider storing your metrics in a dictionary or database keyed by
speech_id
. For example:`{ "speech_id_1": { "stt": [list of STT metrics], "llm": LLM metrics, "tts": TTS metrics } }` 2. Aggregate Metrics: Once everything is neatly organized, you can calculate total latency per turn by adding up the relevant metrics (like STT duration + LLM processing time + TTS time). This method helps keep all your metrics grouped logically and easy to understand. If you'd like, I’m happy to walk you through a more detailed example or assist further—just let me know!
You’re absolutely right that having multiple STT events for a single LLM can make aggregating metrics a bit tricky. But don’t worry, there’s a way to handle this smoothly by organizing and correlating your metrics thoughtfully. Here’s a friendly approach you might find helpful: 1. Group Metrics by Session and Speech ID: Try using the
speech_id
field to link related metrics like STT, LLM, and TTS for each user turn. This way, all the metrics for a specific turn stay nicely grouped together. 2. Track Metric Types: Keep separate collections for each metric type (such as STT, LLM, TTS) within the same
speech_id
group. This makes it easier to spot and manage multiple STT events for the same turn. 3. Define Aggregation Rules: For situations where you have multiple STT events, decide on a clear strategy to aggregate them—whether that’s picking the longest duration, averaging them, or using the final STT event. 4. Implement a Processing Pipeline: Setting up a pipeline to process metrics as they come in can help organize everything into a structured format. Here’s a simple example:
Copy code
metrics_store = {}

@session.on("metrics_collected")
def handle_metrics(event):
    speech_id = event.metrics.speech_id
    if not speech_id:
        return

    if speech_id not in metrics_store:
        metrics_store[speech_id] = {"stt": [], "llm": None, "tts": None}

    if isinstance(event.metrics, STTMetrics):
        metrics_store[speech_id]["stt"].append(event.metrics)
    elif isinstance(event.metrics, LLMMetrics):
        metrics_store[speech_id]["llm"] = event.metrics
    elif isinstance(event.metrics, TTSMetrics):
        metrics_store[speech_id]["tts"] = event.metrics

# Aggregate metrics at the end of the session
def aggregate_metrics():
    for speech_id, data in metrics_store.items():
        stt_durations = [stt.duration for stt in data["stt"]]
        total_stt_duration = sum(stt_durations)
        llm_ttft = data["llm"].ttft if data["llm"] else 0
        tts_ttfb = data["tts"].ttfb if data["tts"] else 0
        total_latency = total_stt_duration + llm_ttft + tts_ttfb
        print(f"Speech ID {speech_id}: Total Latency = {total_latency}s")
This method helps keep your metrics organized and aggregated in a clear, systematic way. If anything feels unclear or if you want to dive deeper, just let me know—I’m here to help!
b
Maybe we could retrieve the metrics on the age and end turn. Does the session provide any way to retrieve this?
t
I'm happy to help you retrieve metrics at the start and end of a turn! Here are some useful methods you can use: • End-of-utterance (EOU) Metrics: These are provided when the user finishes speaking and include: ◦ `end_of_utterance_delay`: This tells you the time from the end of speech to when the turn completes. ◦ `transcription_delay`: This shows the time between the end of speech and when the final transcript becomes available. ◦ `speech_id`: A unique identifier for the turn, which is great for correlating different events. • OpenTelemetry Tracing: This captures detailed trace-level data for session start, agent turns, and end-of-turn detection. Attributes like
ATTR_START_TIME
and
ATTR_END_TIME
help mark the timing boundaries for these operations. • Word-level Timing: If you enable
use_tts_aligned_transcript=True
, you can access
TimedString
objects that give you precise start and end times for each word within a turn. These tools make it easier to track and analyze metrics at both the start and end of a turn. If you'd like, I’m here to provide more details or walk you through some examples—just let me know! Sources: Logs, metrics, and telemetry | LiveKit Docs | livekit-agents/livekit/agents/telemetry/trace_types.py | Text and transcriptions | LiveKit Docs | Text and transcriptions | LiveKit Docs
I'm happy to help you retrieve metrics at the start and end of a turn! Here are some methods you can use: • End-of-utterance (EOU) Metrics: These are generated right when the user finishes speaking and include: ◦ `end_of_utterance_delay`: This tells you the time from when the speech ends to the turn completion. ◦ `transcription_delay`: This shows the time between the end of speech and when the final transcript becomes available. ◦ `speech_id`: A unique identifier that helps you correlate events within a turn. • OpenTelemetry Tracing: This is great for capturing detailed trace-level data for session starts, agent turns, and end-of-turn detection. Attributes like
ATTR_START_TIME
and
ATTR_END_TIME
help mark the timing boundaries for these operations. • Word-level Timing: If you enable
use_tts_aligned_transcript=True
, you can access
TimedString
objects that give you precise start and end times for each word within a turn. These tools are designed to help you track and analyze turn metrics smoothly. If you'd like, I can provide more details or examples to guide you through the implementation—just let me know!
b
i meant once the agent turn iss done sorry.
t
To help you retrieve metrics right after the agent's turn is completed, you can simply subscribe to the
metrics_collected
event emitted by the
AgentSession
. This event shares valuable metrics at important points, including once the agent finishes their turn. For instance, the
EOUMetrics
event is triggered when the agent completes processing and includes details like: • `end_of_utterance_delay`: The time from when the user stops speaking until the turn is completed. • `transcription_delay`: How long it takes from the end of speech to when the final transcript is ready. • `on_user_turn_completed_delay`: The duration to run the
on_user_turn_completed
callback. Here’s a simple example to illustrate this:
Copy code
from livekit.agents import metrics, MetricsCollectedEvent

@session.on("metrics_collected")
def on_metrics_collected(event: MetricsCollectedEvent):
    if event.metrics.type == "eou_metrics":
        print(f"EOU Delay: {event.metrics.end_of_utterance_delay}")
        print(f"Transcription Delay: {event.metrics.transcription_delay}")
        print(f"Callback Delay: {event.metrics.on_user_turn_completed_delay}")
If you’d like, I’m here to help with any further questions or provide more examples—just let me know! Sources: Capturing metrics | LiveKit Docs | tests/test_agent_session.py | Logs, metrics, and telemetry | LiveKit Docs
To help you retrieve metrics once the agent's turn is completed, you can subscribe to the
metrics_collected
event emitted by the
AgentSession
. This event shares valuable metrics at important stages, including right after the agent's turn ends. For instance, the
EOUMetrics
event is emitted when the agent finishes processing and includes: • `end_of_utterance_delay`: The time from when the user stops speaking to when the turn is completed. • `transcription_delay`: The time between the end of speech and when the final transcript becomes available. • `on_user_turn_completed_delay`: The time it takes to execute the
on_user_turn_completed
callback. Here’s a simple example to get you started:
Copy code
from livekit.agents import metrics, MetricsCollectedEvent

@session.on("metrics_collected")
def on_metrics_collected(event: MetricsCollectedEvent):
    if event.metrics.type == "eou_metrics":
        print(f"EOU Delay: {event.metrics.end_of_utterance_delay}")
        print(f"Transcription Delay: {event.metrics.transcription_delay}")
        print(f"Callback Delay: {event.metrics.on_user_turn_completed_delay}")
If you have any questions or would like me to walk you through more examples, just let me know—I’m here to help!
b
i mean like here does it have the data? agent_started_speaking
t
I'm happy to help you with this! Yes, LiveKit does provide a way to detect when the agent starts speaking. In LiveKit Agents v0.x, you can use the
agent_started_speaking
event like this:
Copy code
@agent.on("agent_started_speaking")
def agent_started_speaking():
    print("Agent started speaking")
For LiveKit Agents v1.x, this event was updated to
agent_state_changed
. You can listen for when the state changes to
"speaking"
like so:
Copy code
@session.on("agent_state_changed")
def on_agent_state_changed(ev: AgentStateChangedEvent):
    if ev.new_state == "speaking":
        print("Agent started speaking")
If anything feels unclear or if you’d like me to walk you through it step-by-step, please don’t hesitate to ask. I’m here to help! Sources: VoicePipelineAgent | LiveKit Docs | VoicePipelineAgent | LiveKit Docs | Working with the MultimodalAgent class | LiveKit Docs | Working with the MultimodalAgent class | LiveKit Docs | Agents 0.x migration guide | LiveKit Docs
b
cool but does this have latency metrics?
t
Thanks for your question! The
agent_started_speaking
and
agent_state_changed
events don’t include latency metrics. They simply let you know when the agent starts speaking or changes state, but they don’t provide detailed timing information. If you’re looking for latency metrics, the
metrics_collected
event is the way to go. It gives you detailed measurements like: •
ttft
(Time to First Token) •
ttfb
(Time to First Byte) •
end_of_utterance_delay
transcription_delay
If you’d like, I’m here to help you set up
metrics_collected
so you can capture all these details smoothly. Just let me know! Sources: VoicePipelineAgent | LiveKit Docs | Events and error handling | LiveKit Docs | livekit-agents/livekit/agents/voice/events.py | Logs, metrics, and telemetry | LiveKit Docs