Is it still sensible to use livekit-agents framewo...
# ask-ai
t
Is it still sensible to use livekit-agents framework if I want all of the speech or 90% of it be generated by another backend API? Like when a large part of the "text" generation is governed by another service, how do you use Livekit for Voice on top of that? ANy best practices/tips?
t
Thanks for reaching out, Tim. I'm an AI assistant and I'm here to help. Give me a moment while I research this topic.
I'm glad you're exploring using LiveKit's Agents framework alongside another backend API for text generation—it's definitely a great approach! LiveKit supports integration with multiple third-party text-to-speech (TTS) backends, so you can seamlessly synthesize speech while using another service for text generation. Here are some helpful tips to get you started: • Try using LiveKit's TTS plugins for speech synthesis. They offer a consistent interface for various backends like Resemble AI, Google Gemini, and more. • Make sure your text generation service outputs text in a format that works well with LiveKit's TTS input requirements. • If you're working on real-time applications, consider streaming TTS backends supported by LiveKit, such as Resemble AI or Uplift AI, to keep latency low. • Don't forget to take advantage of LiveKit's error handling and session management features to keep your integration smooth and reliable. If you have a specific backend in mind or would like some guidance setting up a particular integration, please feel free to reach out—I’d be happy to walk you through it! Sources: livekit-plugins/livekit-plugins-speechify/README.md | Gemini TTS integration guide | LiveKit Docs | ElevenLabs integration guide | LiveKit Docs
t
I guess my point is that most of the conversational logic will live elsewhere. We have a conversational dialog engine. We just want to build a voice layer on top without putting too much of the agentic logic in the voice layer, unless it's needed to guarantee good UX. Does that make sense?
t
It’s wonderful that you’re thinking of using LiveKit as a voice layer while keeping your conversational logic in an external dialog engine. This approach can really give you the best of both worlds! Here are some helpful tips to make this setup smooth and effective: • Integrate Your Dialog Engine: You can replace LiveKit's default language model (LLM) with a custom adapter that routes requests to your dialog engine. This way, LiveKit manages the voice input/output, and your engine handles the conversation logic seamlessly. • Use Turn Detection: Implementing LiveKit's turn detection models helps keep the conversation natural by recognizing when the user has finished speaking. • Optimize Response Timing: Tweaking endpointing delays can really help balance responsiveness, so the system feels smooth without cutting anyone off. • Dynamic Voice Updates: You can use participant attributes to adjust TTS settings dynamically, like changing the voice style during the conversation to keep things engaging. • Noise Cancellation: Turning on noise cancellation is a great way to boost STT accuracy, especially in real-world, noisy environments. • Tool Calls: Don’t forget about LiveKit’s tool calling features—they’re perfect for fetching real-time data or triggering backend workflows right from your dialog engine. If you’d like, I’m here to help walk you through any specific integration steps or chat about the dialog engine you’re using. Just let me know, and I’ll be happy to provide more detailed guidance! Sources: agent/main.py | agent-examples/agent_examples/pipeline_agent.py | agent.py | Tool definition and use | LiveKit Docs | 24/7 Front-Office Support: How Assort Health Uses LiveKit to Scale Patient ...
It’s wonderful that you’re considering using LiveKit as a voice layer while keeping your conversational logic within your external dialog engine. This approach can really give you the best of both worlds! Here are some friendly tips to help you get the most out of this setup: • Integrate Your Dialog Engine: Think about replacing LiveKit's default language model (LLM) with a custom adapter that directs requests to your dialog engine. This way, LiveKit can focus on handling voice input/output, while your engine takes care of the conversation flow. • Use Turn Detection: It’s a great idea to use LiveKit's turn detection models. They help the system recognize when the user has finished speaking, making conversations feel smooth and natural. • Optimize Endpointing: Finding the right balance with endpointing delays can really improve responsiveness without cutting people off too soon. • Dynamic Voice Updates: You can use participant attributes to adjust TTS settings on the fly—like changing voice style during the conversation—to keep things engaging. • Noise Cancellation: Turning on noise cancellation can make a big difference by boosting speech-to-text accuracy, especially in busy, real-world environments. • Tool Calls: Don’t forget about LiveKit's tool calling features! They’re perfect for pulling in real-time data or triggering backend workflows right from your dialog engine. If you’d like, I’m here to help with any specific integration steps or if you have a particular dialog engine in mind. Just let me know, and I’d be happy to guide you through it!
t
Do you have any examples of such a setup?
t
I'd love to share some examples of how LiveKit can be wonderfully integrated as a voice layer with external conversational dialog engines: • LangGraph Integration: You can use LiveKit's STT and TTS features while letting LangGraph handle the dialog logic. The agent session connects to a LangGraph state machine through the
langchain.LLMAdapter
. For instance, a GPT-4o node in LangGraph manages the conversation flow, and LiveKit takes care of the voice input and output seamlessly. • Zapier MCP Integration: Here, the agent forwards transcribed user requests to Zapier to trigger automations. The MCP server URL is pulled from an environment variable, and LiveKit independently manages the voice processing, separate from your business logic. • OpenAI Realtime API: You can use LiveKit's turn detection model instead of relying on the API's built-in detection. This helps keep everything in sync during interruptions. The
MultimodalAgent
class is great for handling both streaming text and audio, aligning transcriptions perfectly with audio playback. These examples show how LiveKit’s voice capabilities can be smoothly separated from dialog management, making it easy to integrate with other systems. If you'd like, I’m here to help with detailed implementation steps for any of these—just let me know! Sources: OpenAI and LiveKit partner to turn Advanced Voice into an API | examples/voice_agents/email_example.py | examples/src/drive-thru/drivethru_agent.ts
t
Any such examples of integration with Pydantic AI instead of LangGraph?
t
I understand you're looking for examples of integrating LiveKit with Pydantic AI. While there aren't specific resources showing this exact integration, LiveKit is designed to work smoothly with various AI providers through plugins, and the same general integration patterns can definitely be applied to Pydantic AI. For instance, you might consider using
AgentSession
or
VoicePipelineAgent
to connect components like VAD (Voice Activity Detection), STT (Speech-to-Text), and TTS (Text-to-Speech). Then, you can route the conversational logic to Pydantic AI using a custom adapter—this is quite similar to how integrations with OpenAI or LangGraph are set up. If you'd like, I'm more than happy to help you create a custom adapter for Pydantic AI or guide you through the integration process step by step. Just let me know how you'd prefer to move forward—I’m here to support you! Sources: agent.py | agent/main.py | agent-examples/agent_examples/pipeline_agent.py | agent.py | pipeline-tts/playai_tts.py
t
Yes, custom adapter would be cool, or directions to the base abstraction I should use for that
t
I'm glad you're interested in creating a custom adapter to integrate Pydantic AI or any other conversational AI with LiveKit! To help you get started, you can extend the base abstractions provided by the LiveKit Agents framework. Here’s a friendly overview of the process: • Choose the Component to Customize: Think about whether you'd like to extend the
llm.LLM
class for conversational logic,
stt.STT
for speech-to-text, or
tts.TTS
for text-to-speech. • Implement Required Methods: For LLM, you’ll want to implement the
chat()
method to handle conversational responses, and you can use
LLMStream
to emit real-time tokenized responses. For STT or TTS, implementing
stream()
or
synthesize()
methods will be necessary. Here’s a simple example of a custom LLM adapter to give you a clearer picture:
Copy code
from livekit.agents import llm, ChatContext

class CustomLLM(llm.LLM):
    def __init__(self, model_name: str):
        super().__init__()
        self.model_name = model_name

    def chat(self, chat_ctx: ChatContext, **kwargs):
        return CustomLLMStream(llm=self, chat_ctx=chat_ctx)

class CustomLLMStream(llm.LLMStream):
    async def _run(self):
        responses = ["Hello", " how", " can", " I", " help", " you", "?"]
        for token in responses:
            chunk = llm.ChatChunk(
                delta=llm.ChoiceDelta(role="assistant", content=token)
            )
            self._event_ch.send_nowait(chunk)
Once you have your adapter ready, you can easily plug it into the
AgentSession
like this:
Copy code
session = AgentSession(
    stt=custom_stt,  # Your custom STT adapter
    llm=CustomLLM(model_name="pydantic-ai"),
    tts=custom_tts,  # Your custom TTS adapter
    vad=silero.VAD.load(),
)
If you'd like, I’d be happy to provide a complete working template for a specific component like STT or TTS, or walk you through more details on integrating Pydantic AI. Just let me know how I can assist you further! Sources: examples/voice_agents/langgraph_agent.py | Live Conversations with AI using ChatGPT and WebRTC | OpenAI Realtime API and LiveKit | LiveKit Docs
t
So the agentsession, it's not restapi, whatd o you call it if you deploy it?
t
Thank you for your question! The
AgentSession
in LiveKit isn’t a REST API, but rather a real-time orchestrator designed to manage AI agent interactions smoothly. Typically, it’s part of a containerized microservice that connects to a LiveKit server as a participant in a room. Let me walk you through how it works: • Deployment Model: The
AgentSession
is set up within a deployed agent service, often containerized using Docker or managed with Kubernetes. It takes care of the entire interaction lifecycle, including initializing components like STT, LLM, TTS, and VAD. • Connection: The agent joins a LiveKit room as a participant, handling real-time audio and text processing throughout the session. • Usage: It can be deployed as part of a bigger application or as a standalone service, with usage tracked by session minutes and concurrent sessions. If you’d like, I’d be happy to guide you through setting up a containerized deployment for
AgentSession
or provide more details on how it integrates with LiveKit. Just let me know how I can assist you further! Sources: LiveKit Pricing | Baseten STT integration guide | LiveKit Docs | Baseten TTS integration guide | LiveKit Docs | Agent Deployment & Docker Configuration | livekit/livekit_composite | DeepW... | Infrastructure & Deployment | livekit/livekit_composite | DeepWiki
t
So is it websocket or something? that it uses? a python server with websocket
t
Thank you for your question! The
AgentSession
in LiveKit isn’t a REST API, but rather a real-time orchestrator that helps manage AI agent interactions smoothly. When you deploy it, it usually runs as part of a containerized microservice that connects to a LiveKit server as a participant in a room. Here’s a friendly overview of how it works: • Deployment Model: The
AgentSession
lives inside a deployed agent service, often containerized with Docker or managed through Kubernetes. It takes care of the entire interaction lifecycle, including starting up STT, LLM, TTS, and VAD components. • Connection: The agent joins a LiveKit room as a participant and manages real-time voice interactions. Jobs are assigned via the dispatch API, and the
AgentSession
handles each interaction instance thoughtfully. • Usage: This setup is typically part of a bigger application or can run as its own service for specific tasks like transcription or synthesis. It’s also great to know that you can host it on LiveKit Cloud if you want to scale easily. If you’d like, I’m here to help guide you through setting up a containerized deployment or integrating it into your existing infrastructure. Just let me know how I can assist!