I want to implement a Push-to-Talk feature with li...
# ask-ai
g
I want to implement a Push-to-Talk feature with livekit agents, how can I do it? Basically there is a button at the frontend, and the user will click on it to start talking. When they finished, click the button again so the agent can start responding.
t
Thanks for reaching out, Phuong. I'm an AI assistant and I'm here to help. Give me a moment while I research this topic.
Hey there! If you're looking to add a Push-to-Talk (PTT) feature in LiveKit, here’s a friendly step-by-step guide to help you out: 1. Set Up Manual Turn Detection: Start by configuring the
AgentSession
with
turn_detection="manual"
. This lets you control exactly when the audio input turns on and off, so you know when a user starts or ends their turn. 2. Register RPC Methods: Next, use
register_rpc_method
to manage those PTT events. The key methods you’ll want are: ◦ `start_turn`: This kicks things off by enabling audio input and pausing any ongoing processes. ◦ `end_turn`: When the user’s done talking, this disables audio input, saves their turn, and triggers a response. ◦ `cancel_turn`: If needed, this cancels the current turn and clears any input. 1. Initialize Audio Input State: It’s a good idea to disable audio input by default when the session starts. That way, you avoid any accidental recording. 2. Implement the Agent: Build your
Agent
with components like STT, LLM, and TTS, then plug it into your session. 3. Handle User Turns: Use callbacks to smoothly manage user audio input and handle cases where turns might be empty. Want to see it in action? Check out this example implementation. And if you want to dive deeper, the documentation on manual turn detection has some great info too.
g
If I want to use text-only mode (no SST, no TTS), how can I do in the agent.py implementation?
t
Hey there! If you want to set up a text-only mode for your LiveKit agent without using STT (Speech-to-Text) or TTS (Text-to-Speech), here’s a simple way to do it: 1. Create an Agent Class: Start by defining your agent to handle text-only interactions. Here’s an example:
Copy code
class MyAgent(Agent):
    def __init__(self):
        super().__init__(
            instructions="You are a helpful assistant."
        )
1. Configure Session Options: Next, use
RoomInputOptions
and
RoomOutputOptions
to turn off audio and enable text communication only:
Copy code
session = AgentSession(
    llm=openai.LLM(model="gpt-4o-mini"),
    # No TTS or STT needed here
)

await session.start(
    agent=MyAgent(),
    room=ctx.room,
    room_input_options=RoomInputOptions(text_enabled=True, audio_enabled=False),
    room_output_options=RoomOutputOptions(transcription_enabled=True, audio_enabled=False),
)
1. Handle Text Input and Output: Your agent will get text input through the
lk.chat
topic and send replies via the
lk.transcription
topic. You can use the
generate_reply
method to process what users say:
Copy code
session.generate_reply(user_input="user's input text")
1. Run the Agent: Finally, run your agent with this text-only setup. It’ll automatically manage text input and output without any audio:
Copy code
if __name__ == "__main__":
    cli.run_app(WorkerOptions(entrypoint_fnc=entrypoint))
If you want to see everything in action, check out the text_only.py example in the LiveKit Agents repo. This way, you’re all set with audio disabled, and your agent will smoothly handle text inputs and responses. Hope this helps! Sources: livekit-agents/livekit/agents/voice/room_io/room_io.py | Text and transcriptions | LiveKit Docs | Text and transcriptions | LiveKit Docs
g
Can I switch between the voice mode and text-only mode during the same session? Ex: I want to have a switch on the frontend called "Stealth mode", when ON, no STT/TTS, and when OFF, the agent works normally with STT/TTS
t
Absolutely! You can easily switch between voice mode and text-only mode during the same session in LiveKit. Here’s a friendly guide on how to do it: 1. Programmatic Toggling: You can use these methods to turn audio input and output on or off whenever you want: ◦ `session.input.set_audio_enabled(True/False)`: This lets you toggle the audio input. ◦ `session.output.set_audio_enabled(True/False)`: This one toggles the audio output. 1. Frontend Integration: Think about adding a simple switch in your frontend—maybe call it "Stealth mode". When you flip it: ◦ Set
audio_enabled=False
for both input and output to go into text-only mode. ◦ Set
audio_enabled=True
to jump back to voice mode. 1. Session Configuration: Start your session with both text and audio enabled, and then adjust the modes on the fly using the methods above. If you want to dive deeper, check out the LiveKit Agents documentation. It’s got all the details you might need! Sources: Text and transcriptions | LiveKit Docs | README.md | Text and transcriptions | LiveKit Docs
g
What if I want the agent to send some extra text to client, but only "talk" a specific part? The use case here is I want to develop a Tutor agent, that generally talk, but sometime it will send some instruction by text a long side with its voice content. Ex:
Copy code
User (Say): What if I want to say "Hello" to a Japanese people?
Tutor (Say): こんにちは!
Tutor (Text): You also can use some more casual ways such as "おっす!" or "やあ!"
In this example, the Tutor send 2 messages, first one will be TTS-ed, and the 2nd one does not, it will be sent to the client just plain text.
t
Hey there! You can totally send extra text to your client without using TTS, while still using TTS for other parts, by taking advantage of LiveKit's text streams feature. Here’s a simple way to do it: 1. Use Text Streams Directly: Just send plain text messages through the
lk.chat
text stream topic. This skips the TTS processing and delivers the text straight to your client. 2. Configure AgentSession: Make sure your session is set up to handle both text and audio. For example:
Copy code
session = AgentSession(
    llm=your_llm,
    room_input_options=RoomInputOptions(text_enabled=True),
    room_output_options=RoomOutputOptions(audio_enabled=True, transcription_enabled=True),
)
1. Send Text and TTS Responses: You can create both TTS and plain text responses, then send them separately like this:
Copy code
# Generate TTS response
tts_response = agent.generate_tts_response()
# Generate plain text response
plain_text = "This is a plain text response."

# Send TTS response
await session.send_audio(tts_response)
# Send plain text response
await session.send_text(plain_text)
1. Client-Side Handling: On the client side, just handle TTS responses via the
lk.transcription
stream and plain text responses via the
lk.chat
stream like this:
Copy code
// Handle TTS responses
room.registerTextStreamHandler('lk.transcription', async (reader, participantInfo) => {
    const message = await reader.readAll();
    console.log(`TTS response with transcription: ${message}`);
});

// Handle plain text responses
room.registerTextStreamHandler('lk.chat', async (reader, participantInfo) => {
    const message = await reader.readAll();
    console.log(`Plain text response: ${message}`);
});
This way, you get the best of both worlds—TTS for spoken content and plain text for any extra instructions or info. If you want to chat more or need a hand with anything else, just let me know!
g
Should I install pytorch into my environment if I only use Text and Audio mode? (No vision mode)
t
Just a heads up—if you're using LiveKit with Text and Audio modes (without Vision), you don’t need to install PyTorch to get things running. That said, PyTorch might come in handy for some specific tasks, like converting models or handling more advanced integrations beyond the basic agent setup. If you have any questions about this, feel free to ask! Sources: requirements/requirements-convert_hf_to_gguf_update.txt | Azure OpenAI Realtime API and LiveKit | LiveKit Docs | OpenAI Realtime API integration guide | LiveKit Docs | Text and transcriptions | LiveKit Docs | livekit-agents/livekit/agents/voice/chat_cli.py
g
Is it possible to run the agent console mode and use "text-only"? When running the console, it asks for "PortAudio" which I don't have the permission to install on my environment.
t
Absolutely, you can run the LiveKit agent console mode in text-only mode without needing PortAudio. Just set up your session to disable audio and enable text-only communication like this:
Copy code
session = AgentSession(
    llm=openai.LLM(model="gpt-4o-mini"),
    room_input_options=RoomInputOptions(
        text_enabled=True,
        audio_enabled=False
    ),
    room_output_options=RoomOutputOptions(
        transcription_enabled=True,
        audio_enabled=False
    )
)
With this setup: • The agent works purely in text mode, keeping an eye on the
lk.chat
text stream for any incoming messages. • It sends responses as text through the
lk.transcription
stream. • And you won’t need any audio dependencies like PortAudio at all. To get the agent running in console mode, just use:
Copy code
python myagent.py console
This is a great approach if you’re in an environment where installing audio dependencies isn’t an option. If you have any questions or need a hand with anything else, just let me know—I’m here to help! Sources: livekit-agents/livekit/agents/voice/room_io/room_io.py | Text and transcriptions | LiveKit Docs
g
send extra text to your client without using TTS, while still using TTS for other parts, by taking advantage of LiveKit's text streams feature. Here’s a simple way to do it:
1. Use Text Streams Directly: Just send plain text messages through the
lk.chat
text stream topic. This skips the TTS processing and delivers the text straight to your client.
2. Configure AgentSession: Make sure your session is set up to handle both text and audio. For example:
I still don't get how to use it. Suppose I have a code like this:
Copy code
session = AgentSession(
        llm=openai.LLM(model="gpt-4o-mini"),
        stt=openai.STT(model="gpt-4o-transcribe", language="vi"),
        tts=openai.TTS(model="gpt-4o-mini-tts", voice="alloy"),
        turn_detection=MultilingualModel(),
        vad=ctx.proc.userdata["vad"],
        preemptive_generation=True,
    )
    await session.start(
        agent=Assistant(),
        room=ctx.room,
        room_input_options=RoomInputOptions(
            # LiveKit Cloud enhanced noise cancellation
            # - If self-hosting, omit this parameter
            # - For telephony applications, use `BVCTelephony` for best results
            noise_cancellation=noise_cancellation.BVC(),
        ),
    )

    # Join the room and connect to the user
    await ctx.connect()
How can I tweak the logic to make the
send_audio
and
send_text
later?
Copy code
# Send TTS response
await session.send_audio(tts_response)
# Send plain text response
await session.send_text(plain_text)
t
I don't have the answer you're looking for. You could also try asking your question: • in one of the other Slack channels or • to https://deepwiki.com/livekit/livekit_composite which is trained on all LiveKit source code If you find the answer, please post it here to help others!
g
Can I use
room.local_participant.publish_transcription
to send the transcription directly as
text
to client? If so, how can I call it?
Can you help?