Is there anyway to get only text output from gemini live api LiveKit Community #ask-ai

Join Slack

Is there anyway to get only text output from gemin...

# ask-ai

limited-arm-86913

07/13/2025, 8:25 AM

Is there anyway to get only text output from gemini live api?

tall-airline-16756

07/13/2025, 8:26 AM

Thanks for reaching out, James. I'm an AI assistant and I'm here to help. Give me a moment while I research this topic.

tall-airline-16756

07/13/2025, 8:26 AM

Yes, you can get only text output from the Gemini Live API by setting the

modalities

parameter to

["TEXT"]

when you create the

RealtimeModel

instance. For example:

Copy code

llm=google.beta.realtime.RealtimeModel(
    model="gemini-2.0-flash-exp",
    modalities=["TEXT"],
)

This will configure the API to produce only text output. Additionally, if you want your agent to operate in a fully text-only mode (no audio input or output, just text), you can further configure the session like this:

Copy code

session = AgentSession(
    llm=google.beta.realtime.RealtimeModel(
        model="gemini-2.0-flash-exp",
        modalities=["TEXT"],
    ),
    room_input_options=RoomInputOptions(text_enabled=True, audio_enabled=False),
    room_output_options=RoomOutputOptions(transcription_enabled=True, audio_enabled=False),
)

This ensures all input and output are handled as text only and disables audio streams entirely. Let me know if you need more details or examples for a specific language or client. Sources: Gemini Multimodal Live integration guide | LiveKit Docs | agent/main.py | Text and transcriptions | LiveKit Docs

limited-arm-86913

07/13/2025, 8:32 PM

And then can I turn this text output to be audio by using another tts service?

tall-airline-16756

07/13/2025, 8:33 PM

Hey there! Good news - you absolutely can feed the text output from Gemini Live API into any third-party text-to-speech (TTS) service to generate audio. This is totally supported and super helpful when you want more personalized control over voice, language, or style beyond the built-in options. Let me walk you through how you'd typically set this up: 1. First, configure Gemini Live API to output text by setting

modalities=["TEXT"]

. 2. Then, just pull out those text responses from the API. 3. Next, send the text to your favorite TTS service (like Google Cloud TTS, ElevenLabs, Deepgram, AWS Polly, Azure TTS - so many great options!). 4. Finally, receive and play or stream the audio output however you need. The LiveKit Agents framework is really cool because it offers direct integrations for tons of TTS providers, and supports awesome features like voice selection, pronunciation control, and streaming audio for super smooth, natural-sounding interactions. When you're picking a TTS service, think about what matters most to you - things like latency, voice quality, streaming capability, and any special pronunciation or voice customizations you might want. The good news is that many TTS services can be easily configured through LiveKit's unified TTS interfaces. Need help setting up a specific TTS provider or want some example code for a particular framework? Just let me know what you're looking for, and I'll be happy to help! Sources: AI Agents Framework | livekit/livekit_composite | DeepWiki | ElevenLabs integration guide | LiveKit Docs | livekit-plugins/livekit-plugins-spitch/livekit/plugins/spitch/tts.py | Live Conversations with AI using ChatGPT and WebRTC | Speech & Audio Processing | livekit/livekit_composite | DeepWiki | Gemini Multimodal Live integration guide | LiveKit Docs | Agent speech | LiveKit Docs

Open in Slack

Previous Next