This message was deleted LiveKit Community #ask-ai

Join Slack

This message was deleted.

# ask-ai

steep-balloon-41261

02/23/2025, 3:11 PM

This message was deleted.

gray-monkey-83525

02/23/2025, 3:11 PM

To reduce initial latency, consider using the

prewarm

function to load models and resources before the session starts, as suggested in the LiveKit documentation. Additionally, ensure that your server is geographically close to the model servers to minimize network latency, as mentioned in LiveKit's deployment recommendations. If you have already implemented these steps and still experience latency, you might want to explore using a more efficient model or optimizing the prompt size sent to the LLM, as larger prompts can increase processing time. Also, consider using streaming responses to get the first tokens faster, as discussed in LiveKit's Slack channel. Would you like more detailed guidance on any specific aspect? Sources: - LiveKit Slack discussion - LiveKit's deployment recommendations 👍 0 button 👎 0 button 💡 Hint: Mention @runllm in the thread for followups.

gray-monkey-83525

02/23/2025, 3:12 PM

To minimize the prompt size sent to the LLM, you can use the

before_llm_cb

callback to truncate or adjust the chat context before it is sent to the LLM. This can help reduce the number of tokens processed, thereby decreasing latency. According to the LiveKit documentation, you can implement a function to truncate the context if it exceeds a certain length:

Copy code

async def truncate_context(assistant: VoicePipelineAgent, chat_ctx: llm.ChatContext):
    if len(chat_ctx.messages) > 15:
        chat_ctx.messages = chat_ctx.messages[-15:]

agent = VoicePipelineAgent(
    ...
    before_llm_cb=truncate_context,
)

Would you like more detailed guidance on implementing this or any other aspect? Sources: - LiveKit VoicePipelineAgent Documentation 👍 0 button 👎 0 button 💡 Hint: Mention @runllm in the thread for followups.

gray-monkey-83525

02/23/2025, 3:15 PM

If you're already using

before_llm_cb

to manage the prompt size, consider further optimizing the context by focusing on the most relevant parts of the conversation. You can prioritize recent messages or those containing key information. Additionally, ensure that your callback efficiently processes and modifies the context to avoid adding unnecessary latency. Another approach is to use a more compact representation of the conversation context, such as summarizing previous exchanges or removing redundant information. This can help reduce the token count without losing essential context. Would you like more detailed guidance on any specific optimization technique? Sources: - LiveKit VoicePipelineAgent Documentation 👍 0 button 👎 0 button 💡 Hint: Mention @runllm in the thread for followups.

Open in Slack

Previous Next