Is it possible to use a model that takes audio as input and LiveKit Community #ask-ai

Is it possible to use a model that takes audio as ...

rich-dusk-37560

08/06/2025, 11:45 AM

Is it possible to use a model that takes audio as input and provides text as output (similar to gemini realtime)? Then use another tts?

tall-airline-16756

08/06/2025, 11:45 AM

Thanks for reaching out, Phanos. I'm an AI assistant and I'm here to help. Give me a moment while I research this topic.

tall-airline-16756

08/06/2025, 11:45 AM

Yes, LiveKit supports workflows where audio is transcribed into text using Speech-to-Text (STT) models, and the resulting text can then be converted back into speech using Text-to-Speech (TTS) systems. Here’s how it works: 1. STT Integration: Audio is processed by an STT model (e.g., OpenAI Whisper-1 or GPT-4) to generate text in real-time. 2. Text Processing: The transcribed text can be analyzed or modified as needed. 3. TTS Integration: The processed text is converted back into speech using a TTS model (e.g., OpenAI, Google, or Azure). This setup enables seamless voice interactions, such as real-time transcription and synthesized speech playback. Let me know if you need help with specific configurations or integrations! Sources: Text-to-speech (TTS) integrations | LiveKit Docs | Building voice agents | LiveKit Docs | OpenAI STT integration guide | LiveKit Docs | Speech-to-text (STT) integrations | LiveKit Docs

rich-dusk-37560

08/06/2025, 12:01 PM

I would like to skip the STT step. Therefore audio to llm directly and then TTS without STT

tall-airline-16756

08/06/2025, 12:01 PM

Hey there! Just wanted to share that LiveKit lets you send audio straight to LLMs without having to go through STT transcription. Pretty cool, right? This works thanks to Realtime model integrations, like OpenAI Realtime or similar ones, which can take audio directly and respond with speech without any middle steps. Here’s a quick rundown: 1. Realtime Models: These handle audio input right away, so you don’t need separate STT or TTS parts. 2. Configuration: You can set up the Realtime model to work with audio-only input and output, making voice interactions super smooth. 3. Direct Processing: The model listens and responds in real-time, generating speech without turning audio into text first. This approach really boosts performance and keeps all the little speech details intact. If you want, I’m here to help you get this set up or walk you through the configuration!

rich-dusk-37560

08/06/2025, 12:01 PM

I do want TTS though to make my solution more modular

Open in Slack

Previous Next