< millions hair 24956> the transcriptions in log is wrong an LiveKit Community #ask-ai

<@U084APK40NS> the transcriptions in log is wrong ...

miniature-nail-35141

02/17/2025, 10:46 AM

@millions-hair-24956 the transcriptions in log is wrong and different from I speech. I use multimodel agent.Can I use any other models for transcriptions

gray-monkey-83525

02/17/2025, 10:46 AM

It seems likely that the transcription discrepancies you're experiencing with the MultimodalAgent could be due to the specific model being used. According to LiveKit documentation, the MultimodalAgent uses OpenAI's multimodal model for processing audio and generating responses. If you are looking for more accurate transcriptions, you might consider using a different STT provider or model that is optimized for your specific language or use case. For instance, you could explore using the VoicePipelineAgent which allows for more control over the STT, LLM, and TTS components, potentially enabling you to switch to a different STT provider that better suits your needs. Additionally, the LiveKit plugin framework is designed to be extensible, allowing integration with various providers or models for local inference. Would you like more detailed guidance on how to implement these changes? Sources: - LiveKit Documentation - Plugins Overview - LiveKit documentation - VoicePipelineAgent 👍 0 button 👎 0 button 💡 Hint: Mention @runllm in the thread for followups.

miniature-nail-35141

02/17/2025, 10:48 AM

@millions-hair-24956 without using voicepipeline agent any other method available ?

gray-monkey-83525

02/17/2025, 10:48 AM

If you prefer not to use the VoicePipelineAgent, you can still explore other options for improving transcription accuracy by leveraging different STT plugins available in the LiveKit framework. According to the LiveKit documentation, you can use plugins like Deepgram or Google for speech-to-text tasks. These plugins can be integrated into your existing setup without switching to the VoicePipelineAgent. Additionally, you can consider using the Deepgram STT plugin or the OpenAI STT plugin directly in your application. These plugins provide methods to handle audio streams and return transcriptions, which can be more suitable for your specific requirements. Would you like more detailed instructions on how to integrate these plugins? Sources: - LiveKit Documentation - Plugins Overview - Deepgram STT plugin - OpenAI STT plugin 👍 0 button 👎 0 button 💡 Hint: Mention @runllm in the thread for followups.

miniature-nail-35141

02/17/2025, 10:49 AM

@millions-hair-24956 Can you write code for this

gray-monkey-83525

02/17/2025, 10:49 AM

Certainly! Below is a Python code snippet demonstrating how to use the Deepgram STT plugin for transcriptions without using the VoicePipelineAgent. This example assumes you have already set up your LiveKit environment and have the necessary API keys.

Copy code

import asyncio
from livekit import rtc
from livekit.agents import JobContext, AutoSubscribe
from livekit.plugins import deepgram

async def transcribe_audio(ctx: JobContext):
    # Connect to the room
    await ctx.connect(auto_subscribe=AutoSubscribe.AUDIO_ONLY)
    participant = await ctx.wait_for_participant()

    # Initialize Deepgram STT
    stt = deepgram.STT(model="nova-2-general", interim_results=True)

    # Create a transcription stream
    stt_stream = stt.stream()

    async def process_audio(track: rtc.Track):
        audio_stream = rtc.AudioStream(track)
        async for frame in audio_stream:
            stt_stream.push_frame(frame)

    @ctx.room.on("track_subscribed")
    def on_track_subscribed(track: rtc.Track, publication: rtc.TrackPublication, participant: rtc.RemoteParticipant):
        if track.kind == rtc.TrackKind.KIND_AUDIO:
            asyncio.create_task(process_audio(track))

    # Process transcriptions
    async for event in stt_stream:
        if event.type == stt.SpeechEventType.FINAL_TRANSCRIPT:
            print("Final Transcript:", event.alternatives[0].text)

# Run the transcription task
if __name__ == "__main__":
    asyncio.run(transcribe_audio(JobContext()))

This code sets up a connection to a LiveKit room, subscribes to audio tracks, and processes them using the Deepgram STT plugin. It prints the final transcriptions to the console. Would you like more details or modifications to this code? Sources: - https://docs.livekit.io/python/livekit/plugins/deepgram/index.html#livekit.plugins.deepgram.STT - https://docs.livekit.io/python/livekit/plugins/deepgram/index.html#livekit.plugins.deepgram.SpeechStream 👍 0 button 👎 0 button 💡 Hint: Mention @runllm in the thread for followups.

miniature-nail-35141

02/17/2025, 10:51 AM

@millions-hair-24956 use it in multimodel pipeline setup

gray-monkey-83525

02/17/2025, 10:52 AM

To integrate the Deepgram STT plugin into a MultimodalAgent setup, you can modify the agent to use the Deepgram STT for transcription while still leveraging the multimodal capabilities of the agent. Here's an example of how you can achieve this:

Copy code

import asyncio
from livekit import rtc
from livekit.agents import JobContext, WorkerOptions, cli, llm
from livekit.agents.multimodal import MultimodalAgent
from livekit.plugins import deepgram, openai

async def entrypoint(ctx: JobContext):
    # Connect to the room
    await ctx.connect(auto_subscribe=rtc.AutoSubscribe.AUDIO_ONLY)
    participant = await ctx.wait_for_participant()

    # Initialize Deepgram STT
    stt = deepgram.STT(model="nova-2-general", interim_results=True)

    # Initialize OpenAI Realtime Model
    model = openai.realtime.RealtimeModel(
        instructions="You are a helpful assistant.",
        voice="shimmer",
        temperature=0.8,
        modalities=["audio", "text"]
    )

    # Create a MultimodalAgent
    assistant = MultimodalAgent(model=model)

    # Start the agent
    assistant.start(ctx.room)

    # Create a transcription stream
    stt_stream = stt.stream()

    async def process_audio(track: rtc.Track):
        audio_stream = rtc.AudioStream(track)
        async for frame in audio_stream:
            stt_stream.push_frame(frame)

    @ctx.room.on("track_subscribed")
    def on_track_subscribed(track: rtc.Track, publication: rtc.TrackPublication, participant: rtc.RemoteParticipant):
        if track.kind == rtc.TrackKind.KIND_AUDIO:
            asyncio.create_task(process_audio(track))

    # Process transcriptions
    async for event in stt_stream:
        if event.type == stt.SpeechEventType.FINAL_TRANSCRIPT:
            print("Final Transcript:", event.alternatives[0].text)

if __name__ == "__main__":
    cli.run_app(WorkerOptions(entrypoint_fnc=entrypoint))

This code sets up a MultimodalAgent using OpenAI's Realtime API and integrates Deepgram STT for transcription. The agent connects to a LiveKit room, subscribes to audio tracks, and processes them using Deepgram STT, while the multimodal agent handles the conversation flow. Would you like more details or modifications to this code? Sources: - Deepgram STT 👍 0 button 👎 0 button 💡 Hint: Mention @runllm in the thread for followups.

miniature-nail-35141

02/17/2025, 10:56 AM

@millions-hair-24956 use openai stt

gray-monkey-83525

02/17/2025, 10:56 AM

To use the OpenAI STT plugin within a MultimodalAgent setup, you can modify the agent to utilize OpenAI's STT capabilities. Here's an example of how you can achieve this:

Copy code

import asyncio
from livekit import rtc
from livekit.agents import JobContext, WorkerOptions, cli, llm
from livekit.agents.multimodal import MultimodalAgent
from livekit.plugins import openai

async def entrypoint(ctx: JobContext):
    # Connect to the room
    await ctx.connect(auto_subscribe=rtc.AutoSubscribe.AUDIO_ONLY)
    participant = await ctx.wait_for_participant()

    # Initialize OpenAI STT
    stt = openai.STT(model="whisper-large-v3-turbo", interim_results=True)

    # Initialize OpenAI Realtime Model
    model = openai.realtime.RealtimeModel(
        instructions="You are a helpful assistant.",
        voice="shimmer",
        temperature=0.8,
        modalities=["audio", "text"]
    )

    # Create a MultimodalAgent
    assistant = MultimodalAgent(model=model)

    # Start the agent
    assistant.start(ctx.room)

    # Create a transcription stream
    stt_stream = stt.stream()

    async def process_audio(track: rtc.Track):
        audio_stream = rtc.AudioStream(track)
        async for frame in audio_stream:
            stt_stream.push_frame(frame)

    @ctx.room.on("track_subscribed")
    def on_track_subscribed(track: rtc.Track, publication: rtc.TrackPublication, participant: rtc.RemoteParticipant):
        if track.kind == rtc.TrackKind.KIND_AUDIO:
            asyncio.create_task(process_audio(track))

    # Process transcriptions
    async for event in stt_stream:
        if event.type == stt.SpeechEventType.FINAL_TRANSCRIPT:
            print("Final Transcript:", event.alternatives[0].text)

if __name__ == "__main__":
    cli.run_app(WorkerOptions(entrypoint_fnc=entrypoint))

This code sets up a MultimodalAgent using OpenAI's Realtime API and integrates OpenAI STT for transcription. The agent connects to a LiveKit room, subscribes to audio tracks, and processes them using OpenAI STT, while the multimodal agent handles the conversation flow. Would you like more details or modifications to this code? Sources: - OpenAI STT 👍 0 button 👎 0 button 💡 Hint: Mention @runllm in the thread for followups.

Open in Slack

Previous Next