can i use hugging face stt model in livekit
# ask-ai
p
can i use hugging face stt model in livekit
t
Thanks for reaching out, arpan. I'm an AI assistant and I'm here to help. Give me a moment while I research this topic.
LiveKit doesn’t directly support Hugging Face STT models just yet, but don’t worry! You can try creating a custom STT node or even deploy the model to Baseten to get things working together. Also, some providers we support, like Groq, offer Whisper models from the Hugging Face family. Could you let me know which Hugging Face STT model you’re interested in? That way, I can give you more tailored advice and help you out better. Sources: Speech-to-text (STT) integrations | LiveKit Docs | Baseten STT integration guide | LiveKit Docs | livekit-plugins/livekit-plugins-groq/livekit/plugins/groq/models.py
p
how to use baseten
t
Hey there! If you want to use Baseten with LiveKit for STT integration, here’s a friendly step-by-step guide to help you out: 1. Install the Baseten Plugin: Just run this command:
pip install "livekit-agents[baseten]~=1.2"
2. Set Up Authentication: Pop your Baseten API key into the
.env
file like this:
BASETEN_API_KEY=<your-baseten-api-key>
3. Deploy Your Model: Get a websocket-compatible STT model up and running on Baseten. If you hit any snags, their support team is super helpful! Make sure your model endpoint URL looks like this:
wss://<your-model-id>.<http://api.baseten.co/v1/websocket|api.baseten.co/v1/websocket>
4. Integrate with LiveKit: Now, use the Baseten STT plugin in your code. Here’s a quick example:
Copy code
from livekit.plugins import baseten

stt_service = baseten.STT(
    model_endpoint="<wss://your-model-id.api.baseten.co/v1/websocket>",
    language="en"
)

session = AgentSession(
    stt=stt_service
)
If you have any questions or want a hand with any of these steps, just give me a shout—I’m here to help! Sources: Baseten STT integration guide | LiveKit Docs
p
can i use my cstom model that deployed to ollama
t
Hey there! You can totally use your custom model deployed to Ollama with LiveKit. Here’s a simple way to get it going: 1. Install the OpenAI Plugin: Just run this command:
pip install "livekit-agents[openai]~=1.2"
2. Set Up Ollama: Make sure your Ollama server is up and running locally and that you can access it. Don’t forget to swap out
<OLLAMA_BASE_URL>
with your actual server’s base URL. 3. Integrate Your Custom Model: Pop your custom model’s name into the
model
parameter. Like this:
Copy code
from livekit.plugins import openai

session = AgentSession(
    llm=openai.LLM.with_ollama(
        model="your-custom-model-name",  # Replace with your custom model
        base_url="<OLLAMA_BASE_URL>",
    )
)
If you hit any bumps or want a hand with the setup, just give me a shout—I’m here to help! Sources: Ollama integration guide | LiveKit Docs | Large language model (LLM) integrations | LiveKit Docs | External data and RAG | LiveKit Docs
p
can i use any ollama model like tts
t
Just a heads-up: Ollama models aren’t built for TTS (Text-to-Speech) in LiveKit. They’re great for LLM (Large Language Model) stuff like chat and tools, but not for turning text into speech. If you’re looking to add TTS in LiveKit, you can use supported providers like OpenAI TTS or other TTS integrations. You can even mix Ollama for your LLM needs with a separate TTS provider in the same agent. If you want, I’m here to help you get that set up—just let me know! Sources: Large language model (LLM) integrations | LiveKit Docs | pipeline-llm/ollama_llm.py | livekit-plugins/livekit-plugins-openai/livekit/plugins/openai/__init__.py | Ollama integration guide | LiveKit Docs
p
if i use my custome stt
t
Hey there! You can totally use a custom STT model with LiveKit by creating your own custom STT class. Here’s a quick rundown of how to do it: 1. Define a Custom STT Class: Build a class that extends
livekit.agents.stt.STT
. You’ll want to implement methods for batch recognition (
_recognize_impl
) and streaming (
stream
). 2. Handle Audio Processing: Make sure to convert your audio buffers into the format your model needs (like WAV) and send them over to your API endpoint. 3. Integrate with Your API: Connect using HTTP POST or WebSocket to work with your custom STT model. If you’re using WebSocket, you’ll manage the connections and handle transcription results as they come in, in real-time. 4. Configure Your Agent: Plug your custom STT class into an
AgentSession
like this:
Copy code
from livekit.agents import AgentSession

session = AgentSession(
    stt=CustomSTT(
        model="your-model",
        api_key="your-api-key",
        language="en"
    )
)
If you’d like, I’m happy to share some detailed code examples or help you out with any specific part of this setup—just let me know! Sources: Speech-to-text (STT) integrations | LiveKit Docs | livekit-plugins/livekit-plugins-baseten/livekit/plugins/baseten/stt.py | Baseten STT integration guide | LiveKit Docs
p
custome stt class full code
t
Here’s a full example of a custom STT class implementation for LiveKit that you can use:
Copy code
from livekit.agents import stt
import aiohttp
import asyncio
from dataclasses import dataclass

@dataclass
class CustomSTTOptions:
    model: str
    language: str
    api_key: str
    base_url: str
    sample_rate: int = 16000

class CustomSTT(stt.STT):
    def __init__(
        self,
        *,
        model: str = "your-model",
        language: str = "en",
        api_key: str | None = None,
        base_url: str = "<https://your-api.com>",
        sample_rate: int = 16000,
        http_session: aiohttp.ClientSession | None = None,
    ) -> None:
        super().__init__(
            capabilities=stt.STTCapabilities(
                streaming=True,  # Adjust based on your model's features
                interim_results=True  # Adjust based on your model's features
            )
        )
        
        self._opts = CustomSTTOptions(
            model=model,
            language=language,
            api_key=api_key or os.environ.get("YOUR_API_KEY"),
            base_url=base_url,
            sample_rate=sample_rate,
        )
        self._session = http_session

    async def _recognize_impl(
        self,
        buffer: utils.AudioBuffer,
        *,
        language: NotGivenOr[str] = NOT_GIVEN,
        conn_options: APIConnectOptions,
    ) -> stt.SpeechEvent:
        # Convert audio buffer to the format your API expects
        audio_data = rtc.combine_audio_frames(buffer).to_wav_bytes()
        
        # Call your custom STT API
        async with self._ensure_session().post(
            url=f"{self._opts.base_url}/transcribe",
            data=audio_data,
            headers={
                "Authorization": f"Bearer {self._opts.api_key}",
                "Content-Type": "audio/wav",
            },
            timeout=aiohttp.ClientTimeout(total=30),
        ) as response:
            result = await response.json()
            
            # Return a nicely formatted speech event
            return stt.SpeechEvent(
                type=stt.SpeechEventType.FINAL_TRANSCRIPT,
                alternatives=[
                    stt.SpeechData(
                        text=result["text"],
                        language=language if is_given(language) else self._opts.language,
                        confidence=result.get("confidence", 1.0),
                        start_time=0.0,
                        end_time=result.get("duration", 0.0),
                    )
                ]
            )

    def stream(
        self,
        *,
        language: NotGivenOr[str] = NOT_GIVEN,
        conn_options: APIConnectOptions = DEFAULT_API_CONNECT_OPTIONS,
    ) -> SpeechStream:
        config = self._sanitize_options(language=language)
        stream = SpeechStream(
            stt=self,
            opts=config,
            conn_options=conn_options,
        )
        return stream

class SpeechStream(stt.SpeechStream):
    def __init__(
        self,
        *,
        stt: CustomSTT,
        opts: CustomSTTOptions,
        conn_options: APIConnectOptions,
    ) -> None:
        super().__init__(
            stt=stt, 
            conn_options=conn_options, 
            sample_rate=opts.sample_rate
        )
        self._opts = opts

    async def _run(self) -> None:
        """Main loop for streaming transcription."""
        
        async def send_task(ws: aiohttp.ClientWebSocketResponse) -> None:
            # Process audio frames and send to your API
            audio_bstream = utils.audio.AudioByteStream(
                sample_rate=self._opts.sample_rate,
                num_channels=1,
                samples_per_channel=self._opts.sample_rate // 20,  # 50ms chunks
            )
            
            async for data in self._input_ch:
                frames: list[rtc.AudioFrame] = []
                if isinstance(data, rtc.AudioFrame):
                    frames.extend(audio_bstream.write(data.data.tobytes()))
                elif isinstance(data, self._FlushSentinel):
                    frames.extend(audio_bstream.flush())
                
                for frame in frames:
                    # Send audio data to your WebSocket API
                    await ws.send_bytes(frame.data.tobytes())
        
        async def recv_task(ws: aiohttp.ClientWebSocketResponse) -> None:
            # Receive and handle responses from your API
            while True:
                msg = await ws.receive()
                if msg.type == aiohttp.WSMsgType.TEXT:
                    data = json.loads(msg.data)
                    self._process_stream_event(data)
        
        # Connect and manage WebSocket
        ws = await self._connect_ws()
        tasks = [
            asyncio.create_task(send_task(ws)),
            asyncio.create_task(recv_task(ws)),
        ]
        await asyncio.gather(*tasks)

    def _process_stream_event(self, data: dict) -> None:
        """Handle incoming WebSocket messages from your STT API."""
        
        text = data.get("text", "")
        is_final = data.get("is_final", False)
        confidence = data.get("confidence", 1.0)
        
        if not text:
            return
        
        speech_data = stt.SpeechData(
            language=self._opts.language,
            text=text,
            confidence=confidence,
            start_time=data.get("start_time", 0.0),
            end_time=data.get("end_time", 0.0),
        )
        
        event_type = (
            stt.SpeechEventType.FINAL_TRANSCRIPT 
            if is_final 
            else stt.SpeechEventType.INTERIM_TRANSCRIPT
        )
        
        event = stt.SpeechEvent(
            type=event_type,
            alternatives=[speech_data],
        )
        
        self._event_ch.send_nowait(event)
This should give you a solid foundation for both batch and streaming transcription with your custom STT class. If you want me to walk you through any part or if you have questions, just let me know—I’m happy to help!