I'm using the `openai.realtime.RealtimeModel` with...
# ask-ai
m
I'm using the
openai.realtime.RealtimeModel
with
text
modalities to cover both STT and LLM, while TTS is handled by a different model. I realized there's an unnatural delay in the RealtimeModel (it's taking 5-6 seconds per inference). How do I optimize the latency of the model without changing my instructions?
t
Thanks for reaching out, Anas. I'm an AI assistant and I'm here to help. Give me a moment while I research this topic.
m
you didn't respond
t
Sorry if things were a bit unclear earlier! I’ve put together some tips to help speed up the latency when using the
openai.realtime.RealtimeModel
with text. Here’s what you can try: • Lower the
temperature
and cut down on
maxResponseOutputTokens
to get quicker replies. • Tweak the
turnDetection
settings to help with response timing. • Stick to just the text modality if you don’t need audio. • Make sure your network is as close as possible to the API endpoint for better performance. • Keep an eye on latency using
RealtimeModelMetrics
. If you want me to go into more detail on any of these or need a hand with something else, just let me know—I’m here to help!
m
• What are the recommended
TurnDetection
settings and
temperature
to ensure low-latency and good-enough accuracy? • How do I keep an eye on
RealtimeModelMetrics
?
t
Here are some friendly recommendations and steps to help with your queries: • Recommended
TurnDetection
Settings:
Copy code
turnDetection: {
  type: "server_vad",
  threshold: 0.5,
  prefixPaddingMs: 300,
  silenceDurationMs: 200,
  createResponse: true,
  interruptResponse: true
}
These settings strike a nice balance between low latency and solid accuracy. If you want to speed things up a bit more, you could try lowering
silenceDurationMs
to 100-150ms—but just a heads up, that might cause a few more interruptions. • Recommended `Temperature`: I’d suggest setting
temperature
to `0.8`—it keeps responses lively yet focused. • Monitoring `RealtimeModelMetrics`: Here’s a quick example for logging metrics:
Copy code
from livekit.agents.metrics import MetricsCollector, log_metrics

metrics_collector = MetricsCollector()
@session.on("metrics_collected")
def on_metrics_collected(ev):
    log_metrics(ev.metrics)
a. Use the
MetricsCollector
class to keep track of important things like Time To First Token (TTFT), response duration, and tokens per second. b. Log those metrics with the
log_metrics
function so you can easily see how everything’s performing. c. Pay special attention to key metrics like
ttft
and `duration`—they’re great for spotting any slowdowns. If you have any questions or want me to walk you through anything else, just let me know—I’m here to help! Sources: AI Agents Framework | livekit/livekit_composite | DeepWiki | Agent Testing & Evaluation | livekit/livekit_composite | DeepWiki | Logs, metrics, and telemetry | LiveKit Docs
m
What if I don’t wanna enable turn detection because I want to use on_user_turn_completed
t
If you'd rather not enable turn detection and want to stick with
on_user_turn_completed
, no worries! Here are some friendly tips to help you optimize latency: • Manual Audio Control: You can use RPC methods to manually start and stop audio recording, so you have full control over when the model processes the input. • Preemptive Generation: Try starting the response generation as soon as the final transcript is ready, even before the user’s turn is officially done. This can help cut down on waiting time by overlapping tasks. • Custom Turn Detection: Feel free to create your own way to detect when the user’s turn ends (like using VAD) and trigger the response generation yourself. • Audio Buffer Management: Managing audio buffers directly lets you send only the important audio data to the model, which helps speed things up. Here’s a simple example of how preemptive generation might look:
Copy code
async def on_user_turn_completed(turn_ctx: ChatContext, new_message: ChatMessage):
    # Kick off response generation early
    self.session.generate_reply(
        instructions="Start thinking about the response..."
    )
    return
If you want me to walk you through any of these or share more examples, just let me know—I’m here to help! Sources: livekit-agents/livekit/agents/voice/agent_activity.py | Voice activity detection (VAD) parameters | LiveKit Docs
m
Yes please walk me through how to implement that. My understanding is that VAD is slower than TurnDetection. So I don’t fully understand how this would lower my latency
t
Preemptive response generation is mainly built for STT-LLM-TTS pipelines and doesn’t quite fit directly with
RealtimeModel
. But no worries—here’s a friendly guide on how you can speed things up without depending on Turn Detection: • Disable Turn Detection: You can set up the model to turn off server-side turn detection like this:
Copy code
llm = openai.RealtimeModel(
    model="gpt-4o-realtime-preview",
    temperature=0.8,
    voice="alloy",
    turn_detection=None
)
Manual Response Generation: Try using the
generate_reply()
method to kick off responses whenever you want:
Copy code
# Manually trigger response generation
await session.generate_reply(instructions="Respond to the user’s input")
Optimize VAD Parameters: Tweaking server-side VAD settings—like lowering silence duration and threshold values—can help cut down latency. • Monitor Latency: Keep an eye on response times with metrics and logging to spot any slowdowns. For example:
Copy code
@session.on("agent_state_changed")
def _on_agent_state_changed(ev):
    if ev.new_state == "speaking":
        <http://logger.info|logger.info>(f"End-to-end latency: {ev.created_at - last_eou_metrics.last_speaking_time}")
If you want me to walk you through any of these steps or have any questions, just let me know—I’m here to help! Sources: Agent speech and audio | LiveKit Docs | livekit-plugins/livekit-plugins-openai/livekit/plugins/openai/realtime/real...
m
Walk me through the VAD optimization. Also, why and where do I need to use the generate_reply function if the RealtimeModel already covers both STT and LLM
t
To help you get the best low-latency experience with VAD settings in
RealtimeModel
, here are a few tweaks you can try: • `silence_duration_ms`: Try lowering this to around 200ms to catch the end of speech quicker. Just a heads-up—setting it too low might cut people off mid-sentence. • `prefix_padding_ms`: Bringing this down to about 100ms can reduce the delay before the speech starts getting processed. • `threshold`: Dropping this to around 0.3 helps with faster detection, but it might lead to more false alarms if there’s a lot of background noise. Here’s a quick example to show how it all fits together:
Copy code
from livekit.plugins.openai import realtime
from openai.types.beta.realtime.session import TurnDetection

model = realtime.RealtimeModel(
    turn_detection=TurnDetection(
        type="server_vad",
        threshold=0.3,
        prefix_padding_ms=100,
        silence_duration_ms=200,
        create_response=True,
        interrupt_response=True,
    )
)
About the
generate_reply
function—it’s super handy for triggering and customizing responses on the fly, especially when your model is handling both STT and LLM. For instance, you can guide it with instructions like this:
Copy code
await session.generate_reply(
    instructions="Greet the user and offer your assistance."
)
This gives you a lot of flexibility to keep the conversation flowing just the way you want. If you want me to walk you through it or help set things up, just let me know—I’m here to help! Sources: Agent speech and audio | LiveKit Docs | livekit-agents/README.md