When I interrupt a message using text, the text I ...
# ask-ai
l
When I interrupt a message using text, the text I interrupted gets added after the message I sent, in the session history. But it works fine if I interrupt using voice.
t
Thanks for reaching out, Computer. I'm an AI assistant and I'm here to help. Give me a moment while I research this topic.
When you interrupt a message using text, the interrupted text is added to the session history as a new message with an "interrupted" status. This ensures that the interruption is explicitly recorded. However, when using voice, the system typically truncates the audio playback and updates the transcription without adding a new entry to the history. This difference in behavior is due to how text and voice interruptions are processed. If you have specific requirements or need further clarification, let me know! Sources: livekit-agents/livekit/agents/voice/agent_activity.py
l
Is it possible to show the text interrupted messages in the correct order if I don't want to record the 'interruption' status
t
Right now, it’s a bit tricky to show interrupted text messages in the right order without keeping track of the 'interruption' status. This status is actually a key part of how we manage the conversation and make sure the session history stays accurate. If you have any specific needs or want to explore other options, just let me know—I’m here to help figure out the best solution with you. Sources: livekit-agents/livekit/agents/voice/transcription/synchronizer.py
l
In the session history, will the user input ever get interrupted status or is it just the agent's output that gets the interrupted: true status?
t
Just a heads-up: in the session history, only the agent's responses can show an 'interrupted' status. Your input is always saved as part of the session but won’t be marked as interrupted. If anything’s unclear or you want to know more, don’t hesitate to ask—I’m here to help! Sources: livekit-agents/livekit/agents/telemetry/trace_types.py | Agent speech | LiveKit Docs
l
how do I differentiate user text and voice inputs in session history?
t
Hey there! To tell apart user text and voice inputs in the session history, you can take a look at the events logged during the session: • Text Inputs: These show up as text content right away and trigger events like
USER_TEXT_PARTIAL
or
InputTranscriptionCompleted
. • Voice Inputs: These go through speech-to-text (STT) processing and trigger events like
input_speech_started
and
input_speech_stopped
. The transcripts come in a bit later, asynchronously. So, by checking the session history, you can easily spot if the input was text or voice just by looking at the event types and how the content appears. If you want, I’m happy to share more details or examples—just let me know! Sources: livekit-agents/livekit/agents/voice/agent_activity.py
l
The event type always shows 'message'. But I see that the voice inputs have an extra 'transcript_confidence' field in the session history. Will this always be the case?
t
Just a quick heads-up: the
transcript_confidence
field doesn’t always show up in the session history. You’ll usually see it for voice inputs that go through speech-to-text, but sometimes it might be missing. This can happen if: • The voice input doesn’t include confidence info. • You’re using manual turn detection mode, where this field isn’t really needed. • There are errors or some limits in the speech-to-text processing. If you want me to explain anything more or help out with this, just let me know—I’m here for you! Sources: livekit-plugins/livekit-plugins-google/livekit/plugins/google/beta/realtime...
l
How else do I differentiate between text and voice inputs from the session history?
t
Hey there! If you want to tell apart text and voice inputs in your session history without using
transcript_confidence
, here are some handy ways to do it: • Event Types: Voice inputs come with speech events like
START_OF_SPEECH
and
END_OF_SPEECH
, which you won’t see with text inputs. • Turn Detection Phases: Voice inputs have phases like
USER_SPEAKING
, while text inputs skip these. • Audio Buffer Events: Voice inputs include audio buffer events such as
input_audio_buffer.append
, but text inputs don’t. • Transcription Attributes: You might spot attributes like
ATTR_TRANSCRIPTION_DELAY
with voice inputs, which aren’t in text inputs. • Activity Tracking: Voice inputs trigger events like
input_speech_started
and
input_speech_stopped
. If you’d like me to walk you through any of these or help set them up, just give me a shout! Sources: livekit-plugins/livekit-plugins-openai/livekit/plugins/openai/stt.py | agents/src/multimodal/multimodal_agent.ts
l
session.history.to_dict() gave the below output { "id": "item_a0c923a5d094", "type": "message", "role": "assistant", "content": [ "Welcome to" ], "interrupted": True }, { "id": "item_1afcbd2e218f", "type": "message", "role": "user", "content": [ "hello." ], "interrupted": False, "transcript_confidence": 1.0 } How do I differentiate the voice and text inputs?
t
Hey there! To tell the difference between voice and text inputs from the session history, you can look for the
transcript_confidence
field: • Voice Inputs: These will have the
transcript_confidence
field, like in the user message showing
"transcript_confidence": 1.0
. • Text Inputs: These won’t include the
transcript_confidence
field. So in this example, the user input with
"transcript_confidence": 1.0
means it was a voice input. If you want me to explain more or have any questions, just let me know—I’m happy to help!
l
But you said not all voice inputs will have transcript_confidence
t
You're absolutely right that not every voice input will have the
transcript_confidence
field. When that happens, no worries—you can still tell voice and text inputs apart by looking at a few other clues: • Event Types: Voice inputs usually come with speech-related events like
START_OF_SPEECH
or
END_OF_SPEECH
, which you won’t see with text inputs. • Turn Detection Phases: Voice inputs go through phases like
USER_SPEAKING
, whereas text inputs skip these. • Audio Buffer Events: Voice inputs include audio buffer events such as
input_audio_buffer.append
, which text inputs don’t have. • Transcription Attributes: Sometimes voice inputs have attributes like
ATTR_TRANSCRIPTION_DELAY
, even if
transcript_confidence
is missing. Putting these together should help you confidently tell voice and text inputs apart, even without
transcript_confidence
. If you want, I’m here to help you dig deeper!
l
Will these be available in the dictionary? Could you give a list of all the values that I could find the session history dictionary?
t
Here’s a quick rundown of what you’ll usually find in the session history dictionary for each message: • id: This is a unique ID that helps identify the message. • type: The kind of message it is, like "message" or other event types. • role: Who sent it—either the "user" or the "assistant". • content: The actual text or content of the message. • interrupted: A yes/no flag showing if the message got cut off. • transcript_confidence: (Optional) How confident the system is about voice input accuracy. • timestamps: (Optional) When the message was created or last updated. • attributes: (Optional) Extra details like transcription delays or audio buffer info. If you want me to explain any of these in more detail or have questions about specific parts, just let me know—I’m here to help! Sources: Session recording and transcripts | LiveKit Docs | Workflows | LiveKit Docs
l
How do I add timestamps to the history?
t
Just so you know, timestamps are automatically added to the session history for each message. Every item in the history has a
created_at
property that shows exactly when it was added. If you'd like to save the session history along with those timestamps, you can use a shutdown callback to export the history. Here’s a simple example:
Copy code
async def write_transcript():
    current_date = datetime.now().strftime("%Y%m%d_%H%M%S")
    filename = f"/tmp/transcript_{ctx.room.name}_{current_date}.json"
    with open(filename, 'w') as f:
        json.dump(session.history.to_dict(), f, indent=2)

ctx.add_shutdown_callback(write_transcript)
This will save the session history, timestamps included, into a JSON file. If you have any questions or want me to walk you through it, just let me know—I’m here to help!
l
Not to the filename. I meant the timestamp you mentioned in the previous message. as a dictionary key
t
Hey there! If you want to add timestamps as dictionary keys in your session history, you can do this by accessing the
session.history
property and including timestamps for each entry. Here’s a simple example to guide you:
Copy code
import json
from datetime import datetime

# Grab the session history
history = session.history.to_dict()

# Create a dictionary with timestamps as keys
timestamped_history = {}

for item in history:
    # See if the item already has a timestamp
    if 'created_at' in item:
        timestamp = item['created_at']
    else:
        # If not, generate one using the current time
        timestamp = datetime.now().isoformat()
    
    timestamped_history[timestamp] = item

# Save the new timestamped history to a file
with open('session_history.json', 'w') as f:
    json.dump(timestamped_history, f, indent=2)
This way, each item in your session history is neatly organized by its timestamp. And if an item doesn’t have one, no worries—you can just create one on the spot. If you want me to walk you through anything else or have questions, just let me know! Sources: Session recording and transcripts | LiveKit Docs | VADEvent | LiveKit Agents | STTMetrics | LiveKit Agents