Is it possible to improve Realtime Model VAD in gemini LiveKit Community #ask-ai

Join Slack

Is it possible to improve Realtime Model VAD in ge...

# ask-ai

future-stone-69754

08/06/2025, 5:53 AM

Is it possible to improve Realtime Model VAD in gemini?

tall-airline-16756

08/06/2025, 5:53 AM

Thanks for reaching out, Baymurat. I'm an AI assistant and I'm here to help. Give me a moment while I research this topic.

tall-airline-16756

08/06/2025, 5:53 AM

To improve the Realtime Model VAD in Gemini, you can configure it to use LiveKit's turn detection model for more accurate results. Here's how: 1. Disable Server-Side VAD: Set

automatic_activity_detection

disabled

in the

RealtimeInputConfig

to turn off Gemini's default VAD. 2. Configure LiveKit's Turn Detection: Use a compatible turn detection model, such as

MultilingualModel

, by setting it in the

turn_detection

parameter of

AgentSession

. 3. Provide STT for Transcription: Since Gemini doesn't provide interim transcripts, use a separate STT plugin like Deepgram to feed transcription data to the turn detection model. Here's an example configuration:

Copy code

from google.genai import types
from livekit.agents import AgentSession
from livekit.plugins.turn_detector.multilingual import MultilingualModel

session = AgentSession(
    turn_detection=MultilingualModel(),
    llm=google.beta.realtime.RealtimeModel(
        realtime_input_config=types.RealtimeInputConfig(
            automatic_activity_detection=types.AutomaticActivityDetection(
                disabled=True,
            ),
        ),
        input_audio_transcription=None,
    ),
    stt=deepgram.STT(),
)

This setup disables Gemini's VAD and enables LiveKit's turn detection model to handle end-of-turn detection using the provided STT transcription.

future-stone-69754

08/06/2025, 5:55 AM

I wonder how many languages does this support? MultilingualModel? I will be using it for Thai langauge

tall-airline-16756

08/06/2025, 5:55 AM

Hey there! The MultilingualModel in LiveKit is pretty awesome—it supports English plus 13 other languages like Chinese, Dutch, French, German, Indonesian, Italian, Japanese, Korean, Portuguese, Russian, Spanish, and Turkish. That said, it doesn’t support Thai just yet. If you’re looking to work with Thai, you might want to check out some specific STT plugins or other models that are designed for Thai transcription. I’m here if you want help exploring those options! Sources: LiveKit turn detector plugin | LiveKit Docs | LiveKit turn detector plugin | LiveKit Docs

future-stone-69754

08/06/2025, 5:57 AM

what STT do you recommend?

tall-airline-16756

08/06/2025, 5:57 AM

Hey there! Thai language ("th") is indeed listed as a supported language for STT in LiveKit. That said, the exact STT plugins that specifically support Thai transcription aren’t clearly detailed. I’d suggest checking out different STT providers that work with LiveKit to see which ones handle Thai best. If you need any help with that, just let me know! Sources: livekit-plugins/livekit-plugins-spitch/livekit/plugins/spitch/stt.py

future-stone-69754

08/06/2025, 6:25 AM

what available models do we have for gemini realtime

tall-airline-16756

08/06/2025, 6:25 AM

Here are the models available for Gemini Realtime in LiveKit: 1.

gemini-2.0-flash-exp

gemini-2.0-flash-live-001

gemini-live-2.5-flash-preview

gemini-2.5-flash-preview-native-audio-dialog

gemini-2.5-flash-exp-native-audio-thinking-dialog

Just a heads-up: the 2.5 models are still in preview, so they might have a few quirks, like some function calling issues. If you want to dive deeper into any specific model or how to use them, feel free to ask—I’m here to help! Sources: Gemini Live API integration guide | LiveKit Docs

future-stone-69754

08/06/2025, 6:26 AM

how this one`gemini-2.5-flash-exp-native-audio-thinking-dialog` is better tahn

gemini-2.5-flash-exp-native-audio-dialog

tall-airline-16756

08/06/2025, 6:27 AM

Both

gemini-2.5-flash-exp-native-audio-thinking-dialog

and

gemini-2.5-flash-exp-native-audio-dialog

are available models, but there aren't specific details about how they differ. If you have a particular use case in mind, I’d suggest giving both a try to see which one works best for you. I’m here to help if you need any guidance along the way! Sources: livekit-plugins/livekit-plugins-openai/livekit/plugins/openai/models.py

future-stone-69754

08/06/2025, 6:39 AM

Can I disable interruption in gemini realtime?

tall-airline-16756

08/06/2025, 6:39 AM

Absolutely, you can disable interruption in Gemini Realtime by using the

allow_interruptions

parameter. Here’s a quick rundown on how to do it: 1. In JavaScript/TypeScript: Just set

allowInterruptions

false

VPAOptions

when you’re setting up the voice pipeline agent. 2. In Python: When you initialize

AgentSession

, configure it with

allow_interruptions=False

. 3. Manual Control: If you ever need to stop the agent’s current response on the fly, you can use

session.interrupt()

. Just a heads-up: If you’re working with a RealtimeModel that has built-in turn detection, the

allow_interruptions

parameter might not have any effect. It’s a good idea to double-check your model’s features to make sure everything works smoothly for your needs. Sources: VPAOptions | LiveKit Agents | livekit-agents/livekit/agents/voice/agent_session.py | Configuring turn detection and interruptions | LiveKit Docs | Turn detection and interruptions | LiveKit Docs

future-stone-69754

08/06/2025, 6:40 AM

return AgentSession( llm=google.beta.realtime.RealtimeModel( model=lang_config.llm.model or "gemini-2.5-flash-preview-native-audio-dialog", voice=voice_name, temperature=0.8, instructions=full_instructions, modalities=["AUDIO"], language=stt_language, enable_affective_dialog=True, proactivity=True, # Disable server-side turn detection ), allow_interruptions=False, # No vad, stt, or tts needed for Gemini realtime ) But here I am having this error: -06 114016 [ ERROR] livekit.agents: Error in _update_activity_task Traceback (most recent call last): File "/opt/anaconda3/lib/python3.12/site-packages/livekit/agents/utils/log.py", line 16, in async_fn_logs return await fn(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/anaconda3/lib/python3.12/site-packages/livekit/agents/voice/agent_session.py", line 683, in _update_activity_task self._next_activity = AgentActivity(task, self) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/anaconda3/lib/python3.12/site-packages/livekit/agents/voice/agent_activity.py", line 103, in init raise ValueError( ValueError: the RealtimeModel uses a server-side turn detection, allow_interruptions cannot be False, disable turn_detection in the RealtimeModel and use VAD on the AgentSession instead 2025-08-06 114016,690 - ERROR livekit.agents - Error in _update_activity_task Traceback (most recent call last): File "/opt/anaconda3/lib/python3.12/site-packages/livekit/agents/utils/log.py", line 16, in async_fn_logs return await fn(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/anaconda3/lib/python3.12/site-packages/livekit/agents/voice/agent_session.py", line 683, in _update_activity_task self._next_activity = AgentActivity(task, self) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/anaconda3/lib/python3.12/site-packages/livekit/agents/voice/agent_activity.py", line 103, in init raise ValueError( ValueError: the RealtimeModel uses a server-side turn detection, allow_interruptions cannot be False, disable turn_detection in the RealtimeModel and use VAD on the AgentSession instead 2025-08-06 114016 [ DEBUG] livekit.agents: start reading stream 2025-08-06 114016,847 - DEBUG livekit.agents - start reading stream {"participant": "user-anonymous", "source": "SOURCE_MICROPHONE"} 2025-08-06 114016 [ ERROR] __mp_main__: Agent initialization failed: the RealtimeModel uses a server-side turn detection, allow_interruptions cannot be False, disable turn_detection in the RealtimeModel and use VAD on the AgentSession instead 2025-08-06 114016,847 - ERROR __mp_main__ - Agent initialization failed: the RealtimeModel uses a server-side turn detection, allow_interruptions cannot be False, disable turn_detection in the RealtimeModel and use VAD on the AgentSession instead 2025-08-06 114016 [ ERROR] livekit.agents: unhandled exception while running the job task Traceback (most recent call last): File "/Users/baymuratabdumuratov/Documents/aicoach-backend/livekit-service/livekit-agent.py", line 892, in entrypoint await agent.run(ctx) File "/Users/baymuratabdumuratov/Documents/aicoach-backend/livekit-service/livekit-agent.py", line 578, in run await self._start_session(session, ctx, agent) File "/Users/baymuratabdumuratov/Documents/aicoach-backend/livekit-service/livekit-agent.py", line 844, in _start_session await session.start( File "/opt/anaconda3/lib/python3.12/site-packages/livekit/agents/voice/agent_session.py", line 424, in start await asyncio.gather(*tasks) File "/opt/anaconda3/lib/python3.12/site-packages/livekit/agents/utils/log.py", line 16, in async_fn_logs return await fn(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/anaconda3/lib/python3.12/site-packages/livekit/agents/voice/agent_session.py", line 683, in _update_activity_task self._next_activity = AgentActivity(task, self) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/anaconda3/lib/python3.12/site-packages/livekit/agents/voice/agent_activity.py", line 103, in init raise ValueError( ValueError: the RealtimeModel uses a server-side turn detection, allow_interruptions cannot be False, disable turn_detection in the RealtimeModel and use VAD on the AgentSession instead 2025-08-06 114016,848 - ERROR livekit.agents - unhandled exception while running the job task Traceback (most recent call last): File "/Users/baymuratabdumuratov/Documents/aicoach-backend/livekit-service/livekit-agent.py", line 892, in entrypoint await agent.run(ctx) File "/Users/baymuratabdumuratov/Documents/aicoach-backend/livekit-service/livekit-agent.py", line 578, in run await self._start_session(session, ctx, agent) File "/Users/baymuratabdumuratov/Documents/aicoach-backend/livekit-service/livekit-agent.py", line 844, in _start_session await session.start( File "/opt/anaconda3/lib/python3.12/site-packages/livekit/agents/voice/agent_session.py", line 424, in start await asyncio.gather(*tasks) File "/opt/anaconda3/lib/python3.12/site-packages/livekit/agents/utils/log.py", line 16, in async_fn_logs return await fn(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/anaconda3/lib/python3.12/site-packages/livekit/agents/voice/agent_session.py", line 683, in _update_activity_task self._next_activity = AgentActivity(task, self) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/anaconda3/lib/python3.12/site-packages/livekit/agents/voice/agent_activity.py", line 103, in init raise ValueError( ValueError: the RealtimeModel uses a server-side turn detection, allow_interruptions cannot be False, disable turn_detection in the RealtimeModel and use VAD on the AgentSession instead

tall-airline-16756

08/06/2025, 6:42 AM

Hey there! The error you're seeing happens because the RealtimeModel uses server-side turn detection, which doesn't play well with setting

allow_interruptions=False

. No worries though—we can fix this by turning off server-side turn detection and setting up VAD for your AgentSession. Here's a simple way to do it: 1. Turn Off Turn Detection in RealtimeModel: Just set

turn_detection=None

in your RealtimeModel configuration. Like this:

Copy code

llm = google.beta.realtime.RealtimeModel(
    realtime_input_config=types.RealtimeInputConfig(
        automatic_activity_detection=types.AutomaticActivityDetection(
            disabled=True,
        ),
    ),
    input_audio_transcription=None,
)

1. Switch On VAD in AgentSession: Use a VAD plugin such as Silero and set

allow_interruptions=False

. Here's an example:

Copy code

from livekit.plugins import silero

session = AgentSession(
    allow_interruptions=False,
    llm=llm,  # RealtimeModel with turn_detection=None
    vad=silero.VAD.load(),
)

1. Optional - Add a Turn Detection Model: If you want more advanced turn detection, you can use LiveKit's turn detection model like this:

Copy code

from livekit.plugins.turn_detector.multilingual import MultilingualModel
from livekit.plugins import deepgram, silero

session = AgentSession(
    allow_interruptions=False,
    turn_detection=MultilingualModel(),
    vad=silero.VAD.load(),
    stt=deepgram.STT(),
    llm=llm,
)

This setup moves turn detection to the client side using VAD, which clears up the conflict and lets you disable interruptions smoothly. If you have any questions or want to chat more about this, just let me know! Sources: livekit-plugins/livekit-plugins-openai/livekit/plugins/openai/realtime/real... | Turn detection and interruptions | LiveKit Docs

future-stone-69754

08/06/2025, 7:02 AM

I tried this, but it did not work out: llm=google.beta.realtime.RealtimeModel( model=lang_config.llm.model or "gemini-2.5-flash-preview-native-audio-dialog", voice=voice_name, temperature=0.8, instructions=full_instructions, modalities=["AUDIO"], language=stt_language, enable_affective_dialog=True, proactivity=True, # Disable server-side turn detection realtime_input_config=types.RealtimeInputConfig( automatic_activity_detection=types.AutomaticActivityDetection( disabled=True, ), ), ), vad=silero.VAD.load(), allow_interruptions=False, # No vad, stt, or tts needed for Gemini realtime )

tall-airline-16756

08/06/2025, 7:03 AM

Hey there! It looks like the issue is happening because the Gemini RealtimeModel still has some rules it follows, even when server-side turn detection is turned off. Specifically, setting

allow_interruptions=False

doesn’t quite fit with what the model needs. No worries though—here’s a simple way to fix it: 1. Set `allow_interruptions=True`: If you’re using Gemini Live with client-side turn detection, you’ll want to allow interruptions so the client-side VAD and STT can handle things smoothly. 2. Make sure VAD and STT are set up right: Use client-side VAD and STT to manage interruptions and turn detection. For example:

Copy code

session = AgentSession(
    turn_detection=MultilingualModel(),
    llm=google.beta.realtime.RealtimeModel(
        realtime_input_config=types.RealtimeInputConfig(
            automatic_activity_detection=types.AutomaticActivityDetection(
                disabled=True,
            ),
        ),
        input_audio_transcription=None,
    ),
    stt=deepgram.STT(),
)

This setup turns off server-side turn detection and relies on client-side VAD and STT to handle it instead. If anything’s unclear or you want a hand with this, just let me know—I’m here to help! Sources: Gemini Live API integration guide | LiveKit Docs | livekit-agents/livekit/agents/voice/agent.py | Turn detection and interruptions | LiveKit Docs | Turn detection | LiveKit Docs

Open in Slack

Previous Next