aloof-scooter-47164
09/10/2025, 1:44 AMtall-airline-16756
09/10/2025, 1:44 AMtall-airline-16756
09/10/2025, 1:45 AMBackgroundAudioPlayer
in LiveKit is thoughtfully designed to manage background audio, like ambient sounds or agent-specific effects, during a session. It uses the AudioMixer
class to blend multiple audio streams (such as ambient, thinking, and on-demand sounds) into one seamless output stream. This mixed audio is then sent to an rtc.AudioSource
for publication to the LiveKit room as a separate audio track.
The AudioMixer
takes care of mixing these streams in real-time, making sure they stay synchronized and output in fixed-size audio chunks. You can add or remove each stream dynamically, and control audio levels individually. The player supports built-in audio clips, custom file paths, and async iterators that yield rtc.AudioFrame
objects.
About the audio pitch differences you're experiencing, this might be related to how the audio frames are processed or the sample rate configuration. If you could share a bit more about how you're handling the received audio frames in your Python setup, that would be really helpful. Together, we can pinpoint if any adjustments are needed to better align with the `BackgroundAudioPlayer`'s output.
Sources: Agent speech and audio | LiveKit Docs | examples/voice_agents/background_audio.py | livekit-rtc/livekit/rtc/audio_mixer.py | Speech & Audio Processing | livekit/livekit_composite | DeepWikialoof-scooter-47164
09/10/2025, 1:52 AM@self.room.on("track_subscribed")
def on_track_subscribed(
track: rtc.Track,
publication: rtc.RemoteTrackPublication,
participant: rtc.RemoteParticipant,
):
<http://logger.info|logger.info>(
f"Track subscribed from {participant.identity}: {publication.sid}"
)
# Handle audio track subscription
if track.kind == rtc.TrackKind.KIND_AUDIO:
if (
self.agent_participant
and participant.identity == self.agent_participant.identity
):
asyncio.create_task(
self._handle_remote_audio_track(
cast(rtc.AudioTrack, track), participant
),
)
else:
<http://logger.info|logger.info>(
f"Not handling audio track from {participant.identity}: "
f"{publication.sid}"
)
async def _handle_remote_audio_track(
self, track: rtc.AudioTrack, participant: rtc.RemoteParticipant
):
"""Handle incoming audio track from remote participant."""
<http://logger.info|logger.info>(f"Setting up audio stream from {participant.identity}")
try:
# Create audio stream
self.remote_audio_stream = rtc.AudioStream(
track, sample_rate=self.sample_rate, num_channels=self.num_channels
)
# Forward to audio handler if callback is available
if self.get_audio_handler_callback:
audio_handler = self.get_audio_handler_callback()
if audio_handler:
await audio_handler.receive_audio_from_livekit(
self.remote_audio_stream
)
except Exception as e:
logger.error(f"Error handling remote audio track: {e}")
///
async def receive_audio_from_livekit(self, audio_stream: "rtc.AudioStream"):
"""
Receive audio frames from LiveKit remote participant and push to playback.
Args:
audio_stream: LiveKit audio stream from remote participant
"""
if not rtc:
logger.warning("LiveKit not available for receiving audio")
return
self.is_receiving = True
frames_received = 0
try:
<http://logger.info|logger.info>("Started receiving audio from LiveKit")
async for frame_event in audio_stream:
if not self.is_receiving:
break
frames_received += 1
# Extract audio data from frame
audio_data = np.frombuffer(frame_event.frame.data, dtype=np.int16)
# Push audio directly to playback if callback is set
if self.audio_output_callback:
try:
written = self.audio_output_callback(audio_data)
if written < len(audio_data):
logger.warning(
f"Playback buffer overflow: "
f"{len(audio_data) - written} samples dropped"
)
except Exception as e:
logger.error(f"Error pushing audio to playback: {e}")
# Apply reverse stream processing for echo cancellation
# For reverse stream, we need to process it in chunks too
if self.apm and len(audio_data) == self.apm_frame_size:
try:
render_frame = rtc.AudioFrame(
data=audio_data.tobytes(),
sample_rate=self.sample_rate,
num_channels=self.num_channels,
samples_per_channel=len(audio_data),
)
self.apm.process_reverse_stream(render_frame)
except Exception as e:
if frames_received <= 10:
logger.warning(f"Error processing reverse stream: {e}")
# Log progress
if frames_received <= 5:
logger.debug(f"Received frame {frames_received} from LiveKit")
elif frames_received % 100 == 0:
logger.debug(f"Received {frames_received} frames from LiveKit")
except Exception as e:
logger.error(f"Error receiving audio from LiveKit: {e}")
finally:
self.is_receiving = False
<http://logger.info|logger.info>(f"Stopped receiving audio. Total frames: {frames_received}")
tall-airline-16756
09/10/2025, 1:52 AMBackgroundAudioPlayer
uses the AudioMixer
, which outputs audio frames with specific settings like sample rate and number of channels.
To help you get the playback just right and avoid any pitch issues, here are a few things to check:
• Make sure that the sample_rate
and num_channels
in your rtc.AudioStream
match what the BackgroundAudioPlayer
is using.
• Take a look at whether the audio frames are being resampled or changed before playback. Sometimes, if the sample rate doesn’t match during playback, it can cause the pitch to shift.
• Double-check that your audio_output_callback
is processing the audio data without changing its timing or sample rate.
If you can share the sample rate and channel setup that your BackgroundAudioPlayer
uses, I’d be happy to help you adjust your processing to fit perfectly. Please feel free to reach out if you want to go over this together or if there’s anything else I can assist with!aloof-scooter-47164
09/10/2025, 1:53 AM# Audio configuration
self.sample_rate = 48000
self.num_channels = 1
self.apm_frame_size = 480 # APM expects exactly 10ms at 48kHz
tall-airline-16756
09/10/2025, 1:53 AMBackgroundAudioPlayer
in LiveKit uses the same default sample rate of 48000 Hz and 1 channel, just like your setup. This match usually helps avoid pitch problems that come from mismatched sample rates or channel counts.
If you’re still noticing pitch issues, it might be worth taking a closer look at how the audio frames are being handled or processed. Here are a couple of things to double-check:
• Make sure the audio_output_callback
plays the audio frames without changing their timing or sample rate.
• Verify that the apm.process_reverse_stream
is set up correctly and isn’t unintentionally modifying the audio frames.
If the problem continues, please don’t hesitate to reach out. I’m here to help, and we can definitely dig deeper together to find a solution.
Sources: livekit-plugins/livekit-plugins-playai/livekit/plugins/playai/tts.pyaloof-scooter-47164
09/10/2025, 1:55 AMaudio_output_callback
without the BackgroundAudioPlayer, plays agent audio correctly pitched
• apm.process_reverse_stream
is the one from Livekit RTC for AudioProcessingModuletall-airline-16756
09/10/2025, 1:56 AMBackgroundAudioPlayer
are pre-converted to 48kHz, mono format. Sometimes, mismatched sample rates can sneakily cause pitch distortion.
• Custom Audio Processing: If you’re using any custom audio processing like echo cancellation or noise suppression, take a moment to verify that it’s not unintentionally changing the pitch. For example, check if the AudioProcessingModule
might be applying any unexpected changes.
• Buffer Management: Keep an eye out for buffer underruns or overflows in the BackgroundAudioPlayer
. Sometimes, increasing the queue_size_ms
in the AudioSource
configuration can help keep the frame delivery smooth and stable.
• Volume Scaling: The BackgroundAudioPlayer
does apply volume scaling, which could potentially introduce artifacts if not managed carefully. It’s worth ensuring this isn’t impacting the audio output.
If none of these suggestions seem to fix the issue, could you please let me know if you’re using any custom audio processing delegates or third-party pitch modification libraries in your setup? That info would really help us get to the bottom of this together.
Sources: Speech & Audio Processing | livekit/livekit_composite | DeepWiki | Sarvam TTS integration guide | LiveKit Docs | livekit-plugins/livekit-plugins-inworld/livekit/plugins/inworld/tts.py | realtime/openai-realtime-pitch-shift.py | Sources/LiveKit/Audio/Manager/AudioManager.swiftaloof-scooter-47164
09/10/2025, 2:00 AMtall-airline-16756
09/10/2025, 2:00 AMsample_rate
property of your AudioFrame
objects and see if it matches the expected rate, like 48000 Hz for the BackgroundAudioPlayer
.
• Try using LiveKit’s handy built-in functions such as combineAudioFrames
or mergeFrames
. These will actually let you know by throwing errors if there’s a mismatch.
• Keep an eye on your logs or any exceptions in your processing pipeline—LiveKit is pretty good at validating sample rate consistency and will alert you if something’s off.
If you do spot any mismatches, a great way forward is to use an audio resampler to convert the audio to the right sample rate before playback or further processing. And please, don’t hesitate to reach out if you’d like some help with implementing these checks or setting up resampling—I’m here to support you!
Sources: livekit-rtc/livekit/rtc/utils.py | agents/src/utils.test.ts | tests/test_vad.py | packages/livekit-rtc/src/audio_frame.tsaloof-scooter-47164
09/10/2025, 2:05 AMaloof-scooter-47164
09/10/2025, 2:05 AMReceived frame with length: 480, sample rate: 48000, num_channels: 1, samples_per_channel: 480
tall-airline-16756
09/10/2025, 2:05 AMcombineAudioFrames
function in LiveKit! This handy utility merges multiple rtc.AudioFrame
objects into one by concatenating their audio data. It makes sure all input frames share the same sampleRate
and numChannels
, and will let you know with an error if something doesn’t match up. People often use this for putting together TTS output, short audio captures, or even debugging their workflows.
When working in Python, you might see this function referred to as merge_frames
or combine_frames
. It’s designed to efficiently allocate memory for the combined frame and copy data from each input frame. The final frame keeps the same sampleRate
and numChannels
, with samplesPerChannel
being the total of all input frames combined.
If you’re dealing with audio frames and want to combine them smoothly, this function is a great tool to keep things consistent and simplify your process. Feel free to reach out if you’d like me to share an example or dive into more details—I’m here to help!
Sources: packages/livekit-rtc/src/audio_frame.ts | livekit-rtc/livekit/rtc/audio_mixer.py | plugins/openai/src/realtime/realtime_model.ts | livekit-plugins/livekit-plugins-silero/livekit/plugins/silero/vad.py | Capturing the First Few Seconds of a Call for Processingaloof-scooter-47164
09/10/2025, 2:35 AMtall-airline-16756
09/10/2025, 2:35 AMaloof-scooter-47164
09/10/2025, 2:35 AMtall-airline-16756
09/10/2025, 2:35 AMAudioFrame
you receive—things like sample_rate
, num_channels
, and samples_per_channel
. This way, we can confirm if everything stays consistent across frames.
• Test Without APM: Temporarily turning off the AudioProcessingModule
could reveal if it’s causing any unexpected changes.
• Isolate BackgroundAudioPlayer: Try testing the BackgroundAudioPlayer
on its own by playing a known audio file and checking the pitch to see if it sounds right.
• Resample Input Audio: If you think sample rates might be mismatched, explicitly resampling all input audio to 48kHz before processing or playback can often help.
• Check Volume Scaling: Also, double-check that volume adjustments within the BackgroundAudioPlayer
aren’t introducing any artifacts.
If you’d like, I’m here to walk you through any of these steps or help you analyze specific parts of your setup. Just let me know how you want to move forward—I’m happy to assist!aloof-scooter-47164
09/10/2025, 3:04 AMtall-airline-16756
09/10/2025, 3:04 AMBackgroundAudioPlayer
does adjust audio streams in a few ways, which might explain the changes you're noticing:
• Audio Mixing: It uses rtc.AudioMixer
to blend ambient, thinking, and on-demand audio streams into one output, which is then sent to the room.
• Volume Adjustments: The player applies volume scaling to audio frames, so the audio signal might be a bit different before it’s sent out.
• Buffering: The rtc.AudioSource
includes a 400ms buffer, which can sometimes cause slight delays or timing shifts.
• Agent State Synchronization: It can play sounds like typing noises based on the agent’s state, which dynamically changes the audio stream.
All these features aim to improve your audio experience, though they might sometimes affect the stream in unexpected ways. I’d be happy to help you test specific settings or dig deeper to pinpoint the issue. Just let me know how you’d like to move forward—I’m here to support you!
Sources: examples/drive-thru/drivethru_agent.py | Agent speech and audio | LiveKit Docs | Speech & Audio Processing | livekit/livekit_composite | DeepWikialoof-scooter-47164
09/10/2025, 3:10 AMtall-airline-16756
09/10/2025, 3:10 AMrtc.AudioMixer
brings together multiple audio streams, and if there’s any mismatch in sample rates or timing between them, it could impact the final sound.
• Volume Adjustments: The BackgroundAudioPlayer
adjusts volume on audio frames, and sometimes that can introduce subtle changes if not perfectly handled.
• Buffering: The 400ms buffer in rtc.AudioSource
might cause timing shifts, especially if there are any delays during processing or frame delivery.
To help pinpoint what’s going on, I suggest trying these steps:
1. Turn off the ambient and thinking sounds in the BackgroundAudioPlayer
to see if the pitch issue still happens.
2. Log the audio frames sent by the BackgroundAudioPlayer
to check for any differences in sample rate, channels, or timing.
3. Try a simpler setup, like using just one audio file, to see if the problem is related to mixing or playback.
If you’d like, I’m happy to guide you through any of these tests or help with anything else. Just let me know!aloof-scooter-47164
09/10/2025, 8:34 PMtall-airline-16756
09/10/2025, 8:35 PMaloof-scooter-47164
09/10/2025, 8:50 PM