How do I split transcripts into smaller chunks, ra...
# ask-ai
b
How do I split transcripts into smaller chunks, rather than having a single wall of text?
t
Thanks for reaching out, Tristan. I'm an AI assistant and I'm here to help. Give me a moment while I research this topic.
Do you want chunked transcripts to be delivered in real time as speech is processed, or are you looking to split completed transcripts after the full text is available? This will help determine if you should use built-in segment/partial delivery, tokenization utilities, or custom text processing logic. Sources: .github/next-release/changeset-6ee37a92.md | Voice pipeline nodes | LiveKit Docs | Speechmatics STT integration guide | LiveKit Docs | Transcriptions | LiveKit Docs | WordTokenizer | LiveKit Agents | splitWords | LiveKit Agents
b
I want to split segments partial as well as complete, if the segment is over a certain length.
t
Hey there! Let me break down how you can smartly chunk your transcripts – it's actually pretty simple once you know the tricks. 😊Pipeline Node Customization: Think of this like giving your transcription a friendly "size limit" check. You can tweak the handler to watch incoming segments and split them up if they get too long. It works great for both real-time and finalized text, so you're covered! • Chunking Example: Imagine you want to keep your text chunks nice and neat. Here's a cool Python-style way to do it: `def _text_to_chunks(text: str) -> list[str]: size = 30 # your max length (words/characters) return [text[i:i + size] for i in range(0, len(text), size)]`You can totally customize this to fit your specific needs – it's super flexible! • Frontend Handling: If you're working on the client side, no worries! Just listen for those segment events and chop up the text before you display it. Easy peasy! • Word Tokenizer Utility: Psst – LiveKit has this awesome
WordTokenizer
that makes real-time word segmentation a breeze. Definitely worth checking out! Bottom line: You've got total control here. Whether your segments are partial or final, you can chunk them exactly how you want. Pretty cool, right? 🌟
b
Where does the _text_to_chunks function need to be?
t
Hey there! Let me break down where you'll want to place your custom
_text_to_chunks
function in a way that makes your transcription processing super smooth. Think of it like finding the perfect spot in your workflow. • In the TTS (Text-to-Speech) node: This is where you can jump in and override the
tts_node
in your Agent or pipeline. Just call
_text_to_chunks
before sending text downstream, and you'll handle both partial and complete segments as they come in. Pretty neat, right? • Prior to pushing to any downstream segment forwarder: If you're not using TTS, no worries! Just pop
_text_to_chunks
into your handler or processor that receives transcription segments. It'll split things up before you emit or store the output. This approach is super flexible and works with all kinds of transcript content. The golden rule? Integrate your chunking right after you receive or generate each text segment, but before you send it off for synthesis, display, or storage. It's like giving your text a quick, friendly sorting before it goes on its next adventure. Want more details? Check out the documentation for TTS node customization and chunking patterns—it'll show you the full picture. Sources: livekit-plugins/livekit-plugins-aws/livekit/plugins/aws/stt.py | Voice pipeline nodes | LiveKit Docs
b
It's actually the STT I want to affect, not the TTS.
t
Hey there! Let me help you tackle those long speech-to-text segments that can be a bit unwieldy. I've got some friendly tips on how you can break them down into more manageable pieces. • BufferedTokenStream: Think of this like a helpful buddy that can automatically chop up your text when it gets too long. Just set up your custom chunking logic, and it'll take care of the rest while the audio is being transcribed. • Custom
stt_node
Override:
If you're using LiveKit Agents, you can totally customize how you handle those segments. Just peek at each chunk of text and split it up if it's getting too wordy. • Direct Event Handling: Another great option is to jump in right when you get an STT event. You can quickly break down both the draft and final versions of the transcript before they move along. Want to see how this might look in action? Check out this example:
Copy code
class CustomChunkingAgent(Agent):
    async def stt_node(self, audio, model_settings):
        async for event in Agent.default.stt_node(self, audio, model_settings):
            text = event.alternatives[0].text
            if len(text) > max_length:
                for chunk in _text_to_chunks(text, max_length):
                    # Create and yield a new STT event for each chunk
                    yield ...
            else:
                yield event
The cool thing about these approaches is that you can catch those long text segments early. This means everything downstream gets nice, bite-sized pieces of text – whether it's a draft or the final version. If you want to dive deeper, I've got some great references you can check out. Sources: livekit-plugins/livekit-plugins-aws/livekit/plugins/aws/stt.py | Speech-to-text (STT) integrations | LiveKit Docs | Voice pipeline nodes | LiveKit Docs