is it possible to run an agent pipeline with: `gpt...
# ask-ai
n
is it possible to run an agent pipeline with:
gpt-4o-audio-preview
(audio input) -> text out -> tts plugin out (ie. elevenlabs)?
g
Yes, it is possible to run an agent pipeline with
gpt-4o-audio-preview
for audio input, converting it to text, and then using a TTS plugin like ElevenLabs for audio output. You can achieve this by using the VoicePipelineAgent in LiveKit, which allows you to configure different models for STT, LLM, and TTS. For instance, you can use OpenAI's model for the LLM and ElevenLabs for TTS. To set this up, you would configure the VoicePipelineAgent with the appropriate plugins for each stage of the pipeline. According to the LiveKit documentation, you can customize the models used in the pipeline by specifying the desired STT, LLM, and TTS plugins when creating the agent. Would you like more detailed guidance on setting this up? Sources: - LiveKit Plugins - LiveKit documentation 👍 0 button 👎 0 button 💡 Hint: Mention @runllm in the thread for followups.
n
@millions-hair-24956 can you show me an example of this with the openai plugin (with audio input)? (edited)
g
To set up a VoicePipelineAgent using the OpenAI plugin for audio input and ElevenLabs for TTS, you can follow this example. This setup involves configuring the agent with the appropriate plugins for each stage of the pipeline: STT, LLM, and TTS.
Copy code
from livekit.agents import VoicePipelineAgent
from livekit.plugins import openai, elevenlabs, silero

# Initialize the VoicePipelineAgent
agent = VoicePipelineAgent(
    vad=silero.VAD.load(),  # Voice Activity Detection
    stt=openai.STT(model="gpt-4o-audio-preview"),  # OpenAI STT model
    llm=openai.LLM(),  # OpenAI LLM
    tts=elevenlabs.TTS(model_id="eleven_turbo_v2"),  # ElevenLabs TTS
    allow_interruptions=True,
    interrupt_speech_duration=0.5,
    interrupt_min_words=0,
    min_endpointing_delay=0.5
)

# Start the agent in a room with a participant
agent.start(room, participant)
This example initializes a
VoicePipelineAgent
with OpenAI's STT model for audio input, OpenAI's LLM for processing, and ElevenLabs' TTS for audio output. The
silero.VAD
is used for voice activity detection to manage the conversation flow. Would you like more detailed information on any specific part of this setup? Sources: - VoicePipelineAgent Documentation 👍 0 button 👎 0 button 💡 Hint: Mention @runllm in the thread for followups.
n
@millions-hair-24956 hmm but in this case it would collapse STT + LLM into a single step
g
It seems likely that using OpenAI's
gpt-4o-audio-preview
model would indeed combine the STT and LLM steps into a single process, as this model is designed to handle audio input directly and process it into text or other outputs. This approach simplifies the pipeline by reducing the number of distinct processing steps, but it may limit the flexibility to independently configure STT and LLM components. If you want to maintain separate STT and LLM steps, you might consider using a different STT model that outputs text, which can then be processed by an LLM. This would allow you to have more control over each stage of the pipeline. Would you like more detailed guidance on setting up a pipeline with separate STT and LLM components? Sources: - https://docs.livekit.io/agents/plugins/#VAD-and-StreamAdapter - https://docs.livekit.io/agents/voice-agent/ 👍 0 button 👎 0 button 💡 Hint: Mention @runllm in the thread for followups.
n
@millions-hair-24956 no i'd like to know if the agent library supports an audio input LLM
g
Yes, the LiveKit agent library supports audio input LLMs through the use of the MultimodalAgent class. This class is designed to handle both audio and text inputs, allowing it to process audio directly and generate audio responses. The MultimodalAgent uses OpenAI's Realtime API, which is capable of processing audio inputs and managing conversation flow in real-time. According to the LiveKit documentation, the MultimodalAgent can accept audio inputs and is responsible for managing the conversation state, buffering responses, and handling interruptions. This makes it suitable for applications where audio input is a primary mode of interaction. Would you like more detailed information on how to implement this with the MultimodalAgent? Sources: - MultimodalAgent Documentation 👍 0 button 👎 0 button 💡 Hint: Mention @runllm in the thread for followups.
n
@millions-hair-24956 does the
MultimodalAgent
support
gpt-4o-audio-preview
?
g
The
MultimodalAgent
in LiveKit is designed to work with OpenAI's Realtime API, which includes support for models like
gpt-4o-realtime-preview
. However, specific support for
gpt-4o-audio-preview
is not explicitly mentioned in the provided data sources. The
MultimodalAgent
is generally used for handling both audio and text inputs, leveraging OpenAI's capabilities for real-time processing. If you are looking to use
gpt-4o-audio-preview
specifically, it would be best to verify its compatibility with the
MultimodalAgent
by checking the latest updates in the LiveKit documentation or reaching out to LiveKit support for confirmation. Would you like more detailed guidance on using the
MultimodalAgent
with available models? Sources: - https://docs.livekit.io/python/livekit/plugins/openai/realtime/index.html#raises - https://docs.livekit.io/python/livekit/plugins/openai/realtime/realtime_model.html#raises 👍 0 button 👎 0 button 💡 Hint: Mention @runllm in the thread for followups.
g
Hey @narrow-pager-84057 - were you able to figure this out. I'm trying to do exactly what you were as well