I am facing a issue with voice agent. If the LLM g...
# ask-ai
j
I am facing a issue with voice agent. If the LLM generated 3 lines and they were passed to the TTS, but the user interrupted after listening to only 2 lines, the LLM doesn't know that only 2 lines were spoken yet and the context gets mismatched. How to solve this? Basically how to make the LLM know that what part of the generated answer has been heard by the user.
t
Thanks for reaching out, Raghav. I'm an AI assistant and I'm here to help. Give me a moment while I research this topic.
To address the issue of mismatched context when a user interrupts the TTS output, you can implement the following steps: 1. Detect Interruption: Use an interruption flag (e.g.,
speech_handle.interrupted
) to identify when the user interrupts the TTS output. Cancel ongoing tasks and clear the audio buffer when an interruption occurs. 2. Retrieve Synchronized Transcript: After detecting an interruption, retrieve the synchronized transcript of what was actually spoken. If this is unavailable, fallback to the last known text output. 3. Update LLM Context: Update the LLM's chat context with the text that was heard by the user. Mark this update as interrupted to maintain accurate context. Here’s an example of how this can be implemented:
Copy code
if speech_handle.interrupted:
    await utils.aio.cancel_and_wait(*tasks)
    if audio_output is not None:
        audio_output.clear_buffer()
        playback_ev = await audio_output.wait_for_playout()

if add_to_chat_ctx:
    forwarded_text = text_out.text if text_out else ""
    if speech_handle.interrupted and audio_output is not None:
        playback_ev = await audio_output.wait_for_playout()
        if playback_ev.synchronized_transcript is not None:
            forwarded_text = playback_ev.synchronized_transcript

    if forwarded_text:
        new_message = self._agent._chat_ctx.add_message(
            role="assistant", 
            content=forwarded_text, 
            interrupted=speech_handle.interrupted
        )
        speech_handle._chat_items.append(new_message)
        self._session._conversation_item_added(new_message)
This ensures that the LLM context remains aligned with what the user actually heard, even in cases of interruptions.
j
I don't do a transcript of what's being spoken by the TTS as it's the same as the content generated by the LLM. But the problem is that LLM generates the content in 2-3 secs but the TTS takes time to speak the content. Hence I am unable to tell the LLM upto what point was the content spokem
t
Hey there! To help you sync LLM-generated content with TTS playback and keep track of what’s been spoken, you can use a synchronization mechanism that lines up the text with the audio. Here’s a friendly step-by-step on how to do it: 1. Use a Text-Audio Synchronizer: Try implementing a synchronizer (like
TextAudioSynchronizer
) that coordinates the text and audio playback. It sends out events like
textUpdated
so you can easily see which part of the text has been spoken. 2. Track Playback State: Keep an eye on playback completion using RPC signals or APIs such as
SpeechHandle
. This way, you make sure to update the LLM context only after the audio has actually finished playing. 3. Implementation Example:
Copy code
// Initialize the synchronizer
const synchronizer = new TextAudioSynchronizer(defaultTextSyncOptions);

// Process LLM text and audio output
synchronizer.on('textUpdated', (text) => {
  console.log('Spoken text:', text.text);
});
This method helps you keep the LLM context in sync with what’s really being spoken, even if the TTS takes a bit longer to get through the content. Hope this makes things clearer and easier for you! Sources: agents/src/pipeline/pipeline_agent.ts | Bringing AI avatars to voice agents | Bringing AI avatars to voice agents | livekit-agents/livekit/agents/voice/agent_activity.py