How do I split transcripts into smaller chunks rather than h LiveKit Community #ask-ai

Join Slack

How do I split transcripts into smaller chunks, ra...

# ask-ai

bright-dog-91720

07/11/2025, 12:30 PM

How do I split transcripts into smaller chunks, rather than having a single wall of text?

tall-airline-16756

07/11/2025, 12:30 PM

Thanks for reaching out, Tristan. I'm an AI assistant and I'm here to help. Give me a moment while I research this topic.

tall-airline-16756

07/11/2025, 12:31 PM

Do you want chunked transcripts to be delivered in real time as speech is processed, or are you looking to split completed transcripts after the full text is available? This will help determine if you should use built-in segment/partial delivery, tokenization utilities, or custom text processing logic. Sources: .github/next-release/changeset-6ee37a92.md | Voice pipeline nodes | LiveKit Docs | Speechmatics STT integration guide | LiveKit Docs | Transcriptions | LiveKit Docs | WordTokenizer | LiveKit Agents | splitWords | LiveKit Agents

bright-dog-91720

07/11/2025, 12:32 PM

I want to split segments partial as well as complete, if the segment is over a certain length.

tall-airline-16756

07/11/2025, 12:33 PM

Hey there! Let me break down how you can smartly chunk your transcripts – it's actually pretty simple once you know the tricks. 😊 • Pipeline Node Customization: Think of this like giving your transcription a friendly "size limit" check. You can tweak the handler to watch incoming segments and split them up if they get too long. It works great for both real-time and finalized text, so you're covered! • Chunking Example: Imagine you want to keep your text chunks nice and neat. Here's a cool Python-style way to do it: `def _text_to_chunks(text: str) -> list[str]: size = 30 # your max length (words/characters) return [text[i:i + size] for i in range(0, len(text), size)]`You can totally customize this to fit your specific needs – it's super flexible! • Frontend Handling: If you're working on the client side, no worries! Just listen for those segment events and chop up the text before you display it. Easy peasy! • Word Tokenizer Utility: Psst – LiveKit has this awesome

WordTokenizer

that makes real-time word segmentation a breeze. Definitely worth checking out! Bottom line: You've got total control here. Whether your segments are partial or final, you can chunk them exactly how you want. Pretty cool, right? 🌟

bright-dog-91720

07/11/2025, 12:34 PM

Where does the _text_to_chunks function need to be?

tall-airline-16756

07/11/2025, 12:35 PM

Hey there! Let me break down where you'll want to place your custom

_text_to_chunks

function in a way that makes your transcription processing super smooth. Think of it like finding the perfect spot in your workflow. • In the TTS (Text-to-Speech) node: This is where you can jump in and override the

tts_node

in your Agent or pipeline. Just call

_text_to_chunks

before sending text downstream, and you'll handle both partial and complete segments as they come in. Pretty neat, right? • Prior to pushing to any downstream segment forwarder: If you're not using TTS, no worries! Just pop

_text_to_chunks

into your handler or processor that receives transcription segments. It'll split things up before you emit or store the output. This approach is super flexible and works with all kinds of transcript content. The golden rule? Integrate your chunking right after you receive or generate each text segment, but before you send it off for synthesis, display, or storage. It's like giving your text a quick, friendly sorting before it goes on its next adventure. Want more details? Check out the documentation for TTS node customization and chunking patterns—it'll show you the full picture. Sources: livekit-plugins/livekit-plugins-aws/livekit/plugins/aws/stt.py | Voice pipeline nodes | LiveKit Docs

bright-dog-91720

07/11/2025, 12:40 PM

It's actually the STT I want to affect, not the TTS.

tall-airline-16756

07/11/2025, 12:41 PM

Hey there! Let me help you tackle those long speech-to-text segments that can be a bit unwieldy. I've got some friendly tips on how you can break them down into more manageable pieces. • BufferedTokenStream: Think of this like a helpful buddy that can automatically chop up your text when it gets too long. Just set up your custom chunking logic, and it'll take care of the rest while the audio is being transcribed. • Custom
stt_node
Override: If you're using LiveKit Agents, you can totally customize how you handle those segments. Just peek at each chunk of text and split it up if it's getting too wordy. • Direct Event Handling: Another great option is to jump in right when you get an STT event. You can quickly break down both the draft and final versions of the transcript before they move along. Want to see how this might look in action? Check out this example:

Copy code

class CustomChunkingAgent(Agent):
    async def stt_node(self, audio, model_settings):
        async for event in Agent.default.stt_node(self, audio, model_settings):
            text = event.alternatives[0].text
            if len(text) > max_length:
                for chunk in _text_to_chunks(text, max_length):
                    # Create and yield a new STT event for each chunk
                    yield ...
            else:
                yield event

The cool thing about these approaches is that you can catch those long text segments early. This means everything downstream gets nice, bite-sized pieces of text – whether it's a draft or the final version. If you want to dive deeper, I've got some great references you can check out. Sources: livekit-plugins/livekit-plugins-aws/livekit/plugins/aws/stt.py | Speech-to-text (STT) integrations | LiveKit Docs | Voice pipeline nodes | LiveKit Docs

Open in Slack

Previous Next