Hi Team, I’ve been working with LiveKit Vision and...
# ask-ai
p
Hi Team, I’ve been working with LiveKit Vision and LiveKit Virtual Avatars (Tavus). While testing with the Gemini Live model, we noticed that the lip sync is not accurate. Currently, both STT and TTS are handled by Gemini Live models, but our requirement is slightly different. We would like GPT to handle the STT and TTS, while using the Gemini Live model solely as the LLM to process video frames and provide responses. Is this configuration possible? Code: from dotenv import load_dotenv import logging import os from livekit.agents import ( AgentSession, RoomInputOptions, AutoSubscribe, WorkerOptions, cli, ) from livekit.agents import Agent from livekit.agents import metrics from livekit.plugins import openai, noise_cancellation, silero from livekit.plugins import tavus from livekit.plugins import google load_dotenv() logger = logging.getLogger("vision-agent") logger.setLevel(logging.INFO) azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT") azure_deployment = os.getenv("DEPLOYMENT_NAME") api_key = os.getenv("AZURE_OPENAI_API_KEY") api_version = os.getenv("OPENAI_API_VERSION") azure_endpoint_2 = os.getenv("AZURE_OPENAI_ENDPOINT_2") azure_api_key_2 = os.getenv("AZURE_OPENAI_API_KEY_2") class VisionAgent(Agent): def __init__(self): super().__init__( instructions=( "You are a helpful voice assistant with live video input from your user. If the user speaks in English, respond in English. If the user speaks in Tamil or Hindi, respond in the same language. " ) ) async def entrypoint(ctx): await ctx.connect(auto_subscribe=AutoSubscribe.SUBSCRIBE_ALL) session = AgentSession( vad=silero.VAD.load(), stt=openai.STT.with_azure( model="gpt-4o-transcribe", azure_endpoint=azure_endpoint, api_key=api_key, api_version=api_version, detect_language=True, ), llm=google.beta.realtime.RealtimeModel( voice="Kore", temperature=0.8, ), tts=openai.TTS.with_azure( model="gpt-4o-mini-tts", voice="alloy", azure_endpoint=azure_endpoint_2, api_key=azure_api_key_2, ), ) usage_collector = metrics.UsageCollector() session.on("metrics_collected", lambda ev: usage_collector.collect(ev.metrics)) assistant = VisionAgent() avatar = tavus.AvatarSession( replica_id="rc2146c13e81", persona_id="pdced222244b", ) await avatar.start(session, room=ctx.room) print("Avatar session started successfully") await session.start( agent=assistant, room=ctx.room, room_input_options=RoomInputOptions( noise_cancellation=noise_cancellation.BVC(), video_enabled=True, close_on_disconnect=False ), ) async def log_usage(): summary = usage_collector.get_summary() print("================================= ", summary) logger.info(f"Usage summary: {summary}") ctx.add_shutdown_callback(log_usage) if name == "__main__": cli.run_app(WorkerOptions(entrypoint_fnc=entrypoint))
t
Thanks for reaching out, Mamlesh. I'm an AI assistant and I'm here to help. Give me a moment while I research this topic.
Just to make sure I’m on the right track—are you looking for a step-by-step guide to set up LiveKit Vision and Tavus together, or are you wondering if this setup is even possible with what LiveKit can do right now?
p
lets do both
t
Hey there! Just a heads-up: it’s not really possible to set up LiveKit Vision and Tavus to use GPT for STT and TTS while routing the LLM part to Gemini Live. That’s because Gemini Live is a pretty tightly integrated speech-to-speech model, and it doesn’t let you swap out its LLM component. But don’t worry! There are some other ways you can set things up that might work well for you: • Use Gemini Live for LLM + STT, but skip its TTS: You can turn off audio output in Gemini Live and send the text responses to a different TTS engine like OpenAI, Cartesia, or Google Cloud TTS. • Use Gemini Live with Custom Turn Detection and External STT: You could disable Gemini Live’s automatic activity detection and instead use your own STT engine along with LiveKit’s VAD and turn detection features. • Use Gemini LLM (not Live) with GPT-Enhanced TTS/STT: Another option is to skip Gemini Live altogether and use Gemini LLM for text reasoning, OpenAI’s Whisper for STT, and OpenAI TTS or Cartesia for voice synthesis. • Use Gemini TTS with Non-Gemini LLM: You might also try using GPT or another LLM for reasoning, combined with Gemini TTS to get expressive speech output. If any of these sound interesting, I’d be happy to walk you through the detailed steps or help design a setup that fits exactly what you need. Just let me know! Sources: Large language model (LLM) integrations | LiveKit Docs | Gemini Live API integration guide | LiveKit Docs | AI Voice Assistant Quickstart | LiveKit Docs | Gemini TTS integration guide | LiveKit Docs