polite-dog-12152
08/20/2025, 8:20 AMtall-airline-16756
08/20/2025, 8:20 AMtall-airline-16756
08/20/2025, 8:21 AMpolite-dog-12152
08/20/2025, 8:51 AMtall-airline-16756
08/20/2025, 8:51 AMpip install "livekit-agents[baseten]~=1.2"
2. Set Up Authentication: Pop your Baseten API key into the .env
file like this:
BASETEN_API_KEY=<your-baseten-api-key>
3. Deploy Your Model: Get a websocket-compatible STT model up and running on Baseten. If you hit any snags, their support team is super helpful! Make sure your model endpoint URL looks like this:
wss://<your-model-id>.<http://api.baseten.co/v1/websocket|api.baseten.co/v1/websocket>
4. Integrate with LiveKit: Now, use the Baseten STT plugin in your code. Here’s a quick example:
from livekit.plugins import baseten
stt_service = baseten.STT(
model_endpoint="<wss://your-model-id.api.baseten.co/v1/websocket>",
language="en"
)
session = AgentSession(
stt=stt_service
)
If you have any questions or want a hand with any of these steps, just give me a shout—I’m here to help!
Sources: Baseten STT integration guide | LiveKit Docspolite-dog-12152
08/20/2025, 9:07 AMtall-airline-16756
08/20/2025, 9:07 AMpip install "livekit-agents[openai]~=1.2"
2. Set Up Ollama: Make sure your Ollama server is up and running locally and that you can access it. Don’t forget to swap out <OLLAMA_BASE_URL>
with your actual server’s base URL.
3. Integrate Your Custom Model: Pop your custom model’s name into the model
parameter. Like this:
from livekit.plugins import openai
session = AgentSession(
llm=openai.LLM.with_ollama(
model="your-custom-model-name", # Replace with your custom model
base_url="<OLLAMA_BASE_URL>",
)
)
If you hit any bumps or want a hand with the setup, just give me a shout—I’m here to help!
Sources: Ollama integration guide | LiveKit Docs | Large language model (LLM) integrations | LiveKit Docs | External data and RAG | LiveKit Docspolite-dog-12152
08/20/2025, 9:13 AMtall-airline-16756
08/20/2025, 9:14 AMpolite-dog-12152
08/20/2025, 9:15 AMtall-airline-16756
08/20/2025, 9:17 AMlivekit.agents.stt.STT
. You’ll want to implement methods for batch recognition (_recognize_impl
) and streaming (stream
).
2. Handle Audio Processing: Make sure to convert your audio buffers into the format your model needs (like WAV) and send them over to your API endpoint.
3. Integrate with Your API: Connect using HTTP POST or WebSocket to work with your custom STT model. If you’re using WebSocket, you’ll manage the connections and handle transcription results as they come in, in real-time.
4. Configure Your Agent: Plug your custom STT class into an AgentSession
like this:
from livekit.agents import AgentSession
session = AgentSession(
stt=CustomSTT(
model="your-model",
api_key="your-api-key",
language="en"
)
)
If you’d like, I’m happy to share some detailed code examples or help you out with any specific part of this setup—just let me know!
Sources: Speech-to-text (STT) integrations | LiveKit Docs | livekit-plugins/livekit-plugins-baseten/livekit/plugins/baseten/stt.py | Baseten STT integration guide | LiveKit Docspolite-dog-12152
08/20/2025, 9:18 AMtall-airline-16756
08/20/2025, 9:19 AMfrom livekit.agents import stt
import aiohttp
import asyncio
from dataclasses import dataclass
@dataclass
class CustomSTTOptions:
model: str
language: str
api_key: str
base_url: str
sample_rate: int = 16000
class CustomSTT(stt.STT):
def __init__(
self,
*,
model: str = "your-model",
language: str = "en",
api_key: str | None = None,
base_url: str = "<https://your-api.com>",
sample_rate: int = 16000,
http_session: aiohttp.ClientSession | None = None,
) -> None:
super().__init__(
capabilities=stt.STTCapabilities(
streaming=True, # Adjust based on your model's features
interim_results=True # Adjust based on your model's features
)
)
self._opts = CustomSTTOptions(
model=model,
language=language,
api_key=api_key or os.environ.get("YOUR_API_KEY"),
base_url=base_url,
sample_rate=sample_rate,
)
self._session = http_session
async def _recognize_impl(
self,
buffer: utils.AudioBuffer,
*,
language: NotGivenOr[str] = NOT_GIVEN,
conn_options: APIConnectOptions,
) -> stt.SpeechEvent:
# Convert audio buffer to the format your API expects
audio_data = rtc.combine_audio_frames(buffer).to_wav_bytes()
# Call your custom STT API
async with self._ensure_session().post(
url=f"{self._opts.base_url}/transcribe",
data=audio_data,
headers={
"Authorization": f"Bearer {self._opts.api_key}",
"Content-Type": "audio/wav",
},
timeout=aiohttp.ClientTimeout(total=30),
) as response:
result = await response.json()
# Return a nicely formatted speech event
return stt.SpeechEvent(
type=stt.SpeechEventType.FINAL_TRANSCRIPT,
alternatives=[
stt.SpeechData(
text=result["text"],
language=language if is_given(language) else self._opts.language,
confidence=result.get("confidence", 1.0),
start_time=0.0,
end_time=result.get("duration", 0.0),
)
]
)
def stream(
self,
*,
language: NotGivenOr[str] = NOT_GIVEN,
conn_options: APIConnectOptions = DEFAULT_API_CONNECT_OPTIONS,
) -> SpeechStream:
config = self._sanitize_options(language=language)
stream = SpeechStream(
stt=self,
opts=config,
conn_options=conn_options,
)
return stream
class SpeechStream(stt.SpeechStream):
def __init__(
self,
*,
stt: CustomSTT,
opts: CustomSTTOptions,
conn_options: APIConnectOptions,
) -> None:
super().__init__(
stt=stt,
conn_options=conn_options,
sample_rate=opts.sample_rate
)
self._opts = opts
async def _run(self) -> None:
"""Main loop for streaming transcription."""
async def send_task(ws: aiohttp.ClientWebSocketResponse) -> None:
# Process audio frames and send to your API
audio_bstream = utils.audio.AudioByteStream(
sample_rate=self._opts.sample_rate,
num_channels=1,
samples_per_channel=self._opts.sample_rate // 20, # 50ms chunks
)
async for data in self._input_ch:
frames: list[rtc.AudioFrame] = []
if isinstance(data, rtc.AudioFrame):
frames.extend(audio_bstream.write(data.data.tobytes()))
elif isinstance(data, self._FlushSentinel):
frames.extend(audio_bstream.flush())
for frame in frames:
# Send audio data to your WebSocket API
await ws.send_bytes(frame.data.tobytes())
async def recv_task(ws: aiohttp.ClientWebSocketResponse) -> None:
# Receive and handle responses from your API
while True:
msg = await ws.receive()
if msg.type == aiohttp.WSMsgType.TEXT:
data = json.loads(msg.data)
self._process_stream_event(data)
# Connect and manage WebSocket
ws = await self._connect_ws()
tasks = [
asyncio.create_task(send_task(ws)),
asyncio.create_task(recv_task(ws)),
]
await asyncio.gather(*tasks)
def _process_stream_event(self, data: dict) -> None:
"""Handle incoming WebSocket messages from your STT API."""
text = data.get("text", "")
is_final = data.get("is_final", False)
confidence = data.get("confidence", 1.0)
if not text:
return
speech_data = stt.SpeechData(
language=self._opts.language,
text=text,
confidence=confidence,
start_time=data.get("start_time", 0.0),
end_time=data.get("end_time", 0.0),
)
event_type = (
stt.SpeechEventType.FINAL_TRANSCRIPT
if is_final
else stt.SpeechEventType.INTERIM_TRANSCRIPT
)
event = stt.SpeechEvent(
type=event_type,
alternatives=[speech_data],
)
self._event_ch.send_nowait(event)
This should give you a solid foundation for both batch and streaming transcription with your custom STT class. If you want me to walk you through any part or if you have questions, just let me know—I’m happy to help!