I'm glad you asked about how the STT, LLM, and TTS components in the AgentSession object work together! They’re designed to stream outputs to each other in real-time, which makes everything feel smooth and responsive. Here’s a little breakdown of the process:
•
STT to LLM: Instead of waiting for the full transcription, the LLM starts processing the speech-to-text output as it comes in. This way, it can begin inference on partial transcriptions, helping to reduce any delay.
•
LLM to TTS: Similarly, the TTS system doesn’t wait for the entire LLM response. It starts synthesizing speech as soon as it gets partial text from the LLM, so audio generation can happen step-by-step.
This streaming setup is really clever because it overlaps the work done by STT, LLM, and TTS, making the whole interaction feel much more real-time and efficient. I hope this helps clarify things! If you have any more questions, feel free to ask—I’m here to help.
Sources:
Pipeline nodes and hooks | LiveKit Docs |
livekit-agents/livekit/agents/voice/agent_session.py |
agents/src/voice/agent_activity.ts