is it possible to run an agent pipeline with `gpt 4o audio p LiveKit Community #ask-ai

is it possible to run an agent pipeline with: `gpt...

narrow-pager-84057

02/25/2025, 3:38 AM

is it possible to run an agent pipeline with:

gpt-4o-audio-preview

(audio input) -> text out -> tts plugin out (ie. elevenlabs)?

gray-monkey-83525

02/25/2025, 3:38 AM

Yes, it is possible to run an agent pipeline with

gpt-4o-audio-preview

for audio input, converting it to text, and then using a TTS plugin like ElevenLabs for audio output. You can achieve this by using the VoicePipelineAgent in LiveKit, which allows you to configure different models for STT, LLM, and TTS. For instance, you can use OpenAI's model for the LLM and ElevenLabs for TTS. To set this up, you would configure the VoicePipelineAgent with the appropriate plugins for each stage of the pipeline. According to the LiveKit documentation, you can customize the models used in the pipeline by specifying the desired STT, LLM, and TTS plugins when creating the agent. Would you like more detailed guidance on setting this up? Sources: - LiveKit Plugins - LiveKit documentation 👍 0 button 👎 0 button 💡 Hint: Mention @runllm in the thread for followups.

narrow-pager-84057

02/25/2025, 3:40 AM

@millions-hair-24956 can you show me an example of this with the openai plugin (with audio input)? (edited)

gray-monkey-83525

02/25/2025, 3:40 AM

To set up a VoicePipelineAgent using the OpenAI plugin for audio input and ElevenLabs for TTS, you can follow this example. This setup involves configuring the agent with the appropriate plugins for each stage of the pipeline: STT, LLM, and TTS.

Copy code

from livekit.agents import VoicePipelineAgent
from livekit.plugins import openai, elevenlabs, silero

# Initialize the VoicePipelineAgent
agent = VoicePipelineAgent(
    vad=silero.VAD.load(),  # Voice Activity Detection
    stt=openai.STT(model="gpt-4o-audio-preview"),  # OpenAI STT model
    llm=openai.LLM(),  # OpenAI LLM
    tts=elevenlabs.TTS(model_id="eleven_turbo_v2"),  # ElevenLabs TTS
    allow_interruptions=True,
    interrupt_speech_duration=0.5,
    interrupt_min_words=0,
    min_endpointing_delay=0.5
)

# Start the agent in a room with a participant
agent.start(room, participant)

This example initializes a

VoicePipelineAgent

with OpenAI's STT model for audio input, OpenAI's LLM for processing, and ElevenLabs' TTS for audio output. The

silero.VAD

is used for voice activity detection to manage the conversation flow. Would you like more detailed information on any specific part of this setup? Sources: - VoicePipelineAgent Documentation 👍 0 button 👎 0 button 💡 Hint: Mention @runllm in the thread for followups.

narrow-pager-84057

02/25/2025, 3:41 AM

@millions-hair-24956 hmm but in this case it would collapse STT + LLM into a single step

gray-monkey-83525

02/25/2025, 3:41 AM

It seems likely that using OpenAI's

gpt-4o-audio-preview

model would indeed combine the STT and LLM steps into a single process, as this model is designed to handle audio input directly and process it into text or other outputs. This approach simplifies the pipeline by reducing the number of distinct processing steps, but it may limit the flexibility to independently configure STT and LLM components. If you want to maintain separate STT and LLM steps, you might consider using a different STT model that outputs text, which can then be processed by an LLM. This would allow you to have more control over each stage of the pipeline. Would you like more detailed guidance on setting up a pipeline with separate STT and LLM components? Sources: - https://docs.livekit.io/agents/plugins/#VAD-and-StreamAdapter - https://docs.livekit.io/agents/voice-agent/ 👍 0 button 👎 0 button 💡 Hint: Mention @runllm in the thread for followups.

narrow-pager-84057

02/25/2025, 3:42 AM

@millions-hair-24956 no i'd like to know if the agent library supports an audio input LLM

gray-monkey-83525

02/25/2025, 3:42 AM

Yes, the LiveKit agent library supports audio input LLMs through the use of the MultimodalAgent class. This class is designed to handle both audio and text inputs, allowing it to process audio directly and generate audio responses. The MultimodalAgent uses OpenAI's Realtime API, which is capable of processing audio inputs and managing conversation flow in real-time. According to the LiveKit documentation, the MultimodalAgent can accept audio inputs and is responsible for managing the conversation state, buffering responses, and handling interruptions. This makes it suitable for applications where audio input is a primary mode of interaction. Would you like more detailed information on how to implement this with the MultimodalAgent? Sources: - MultimodalAgent Documentation 👍 0 button 👎 0 button 💡 Hint: Mention @runllm in the thread for followups.

narrow-pager-84057

02/25/2025, 3:43 AM

@millions-hair-24956 does the

MultimodalAgent

support

gpt-4o-audio-preview

gray-monkey-83525

02/25/2025, 3:43 AM

The

MultimodalAgent

in LiveKit is designed to work with OpenAI's Realtime API, which includes support for models like

gpt-4o-realtime-preview

. However, specific support for

gpt-4o-audio-preview

is not explicitly mentioned in the provided data sources. The

MultimodalAgent

is generally used for handling both audio and text inputs, leveraging OpenAI's capabilities for real-time processing. If you are looking to use

gpt-4o-audio-preview

specifically, it would be best to verify its compatibility with the

MultimodalAgent

by checking the latest updates in the LiveKit documentation or reaching out to LiveKit support for confirmation. Would you like more detailed guidance on using the

MultimodalAgent

with available models? Sources: - https://docs.livekit.io/python/livekit/plugins/openai/realtime/index.html#raises - https://docs.livekit.io/python/livekit/plugins/openai/realtime/realtime_model.html#raises 👍 0 button 👎 0 button 💡 Hint: Mention @runllm in the thread for followups.

gentle-hospital-72989

05/15/2025, 5:51 AM

Hey @narrow-pager-84057 - were you able to figure this out. I'm trying to do exactly what you were as well

Open in Slack

Previous Next