Based on the livekit code and the logging of chunk...
# ask-ai
h
Based on the livekit code and the logging of chunk and frame output sequences, it seems that TTS waits for the full LLM stream to complete before it starts outputting audio. But it's obvious that if we want to generate a full agent response, the time-to-first-token plus the time until the last token is generated will always be longer than a non-streaming call. So the logical question is: from a latency perspective, what's the point of using LLM streaming at all?
t
Thanks for reaching out, Владислав. I'm an AI assistant and I'm here to help. Give me a moment while I research this topic.
Thank you for your question! From a latency perspective, the use of LLM streaming in LiveKit can still provide benefits, even if TTS waits for the full LLM stream to complete before outputting audio. Here are some key points: • Time to First Token (TTFT): LLM streaming allows for the generation of the first token earlier, which can reduce the perceived latency for the user. • Latency Optimization: By optimizing network proximity and focusing on metrics like TTFT and TTFB (Time to First Byte for TTS), the overall latency can be minimized. • Instant Mode: Features like Hume's "instant_mode" can provide ultra-low latency streaming for TTS, reducing the time to the first chunk of audio, though it may come at a slightly higher cost. • Total Latency Calculation: The total conversation latency is calculated as the sum of the end-of-utterance delay, TTFT, and TTFB. Streaming helps reduce the TTFT component, which is critical for faster response times. While the total time for generating a full response may still be longer than a non-streaming call, the ability to start processing and generating tokens earlier can improve the overall user experience by reducing the perceived delay. Sources: How can I reduce latency in voice agents using STT, TTS and LLM? | Hume TTS integration guide | LiveKit Docs | Capturing metrics | LiveKit Docs | Capturing metrics | LiveKit Docs