how does the background player works under the hoo...
# ask-ai
a
how does the background player works under the hood? i see there's an audio mixer class it uses. i want to know because when using the background audio player, the audio in my local python sounds pitched down but from the Room recording, is playing correctly so something changes in how i have to process the received audio frames
t
Thanks for reaching out, Sebastián. I'm an AI assistant and I'm here to help. Give me a moment while I research this topic.
The
BackgroundAudioPlayer
in LiveKit is thoughtfully designed to manage background audio, like ambient sounds or agent-specific effects, during a session. It uses the
AudioMixer
class to blend multiple audio streams (such as ambient, thinking, and on-demand sounds) into one seamless output stream. This mixed audio is then sent to an
rtc.AudioSource
for publication to the LiveKit room as a separate audio track. The
AudioMixer
takes care of mixing these streams in real-time, making sure they stay synchronized and output in fixed-size audio chunks. You can add or remove each stream dynamically, and control audio levels individually. The player supports built-in audio clips, custom file paths, and async iterators that yield
rtc.AudioFrame
objects. About the audio pitch differences you're experiencing, this might be related to how the audio frames are processed or the sample rate configuration. If you could share a bit more about how you're handling the received audio frames in your Python setup, that would be really helpful. Together, we can pinpoint if any adjustments are needed to better align with the `BackgroundAudioPlayer`'s output. Sources: Agent speech and audio | LiveKit Docs | examples/voice_agents/background_audio.py | livekit-rtc/livekit/rtc/audio_mixer.py | Speech & Audio Processing | livekit/livekit_composite | DeepWiki
a
Copy code
@self.room.on("track_subscribed")
        def on_track_subscribed(
            track: rtc.Track,
            publication: rtc.RemoteTrackPublication,
            participant: rtc.RemoteParticipant,
        ):
            <http://logger.info|logger.info>(
                f"Track subscribed from {participant.identity}: {publication.sid}"
            )

            # Handle audio track subscription
            if track.kind == rtc.TrackKind.KIND_AUDIO:
                if (
                    self.agent_participant
                    and participant.identity == self.agent_participant.identity
                ):
                    asyncio.create_task(
                        self._handle_remote_audio_track(
                            cast(rtc.AudioTrack, track), participant
                        ),
                    )
                else:
                    <http://logger.info|logger.info>(
                        f"Not handling audio track from {participant.identity}: "
                        f"{publication.sid}"
                    )

    async def _handle_remote_audio_track(
        self, track: rtc.AudioTrack, participant: rtc.RemoteParticipant
    ):
        """Handle incoming audio track from remote participant."""
        <http://logger.info|logger.info>(f"Setting up audio stream from {participant.identity}")

        try:
            # Create audio stream
            self.remote_audio_stream = rtc.AudioStream(
                track, sample_rate=self.sample_rate, num_channels=self.num_channels
            )

            # Forward to audio handler if callback is available
            if self.get_audio_handler_callback:
                audio_handler = self.get_audio_handler_callback()
                if audio_handler:
                    await audio_handler.receive_audio_from_livekit(
                        self.remote_audio_stream
                    )

        except Exception as e:
            logger.error(f"Error handling remote audio track: {e}")

///

    async def receive_audio_from_livekit(self, audio_stream: "rtc.AudioStream"):
        """
        Receive audio frames from LiveKit remote participant and push to playback.

        Args:
            audio_stream: LiveKit audio stream from remote participant
        """
        if not rtc:
            logger.warning("LiveKit not available for receiving audio")
            return

        self.is_receiving = True
        frames_received = 0

        try:
            <http://logger.info|logger.info>("Started receiving audio from LiveKit")

            async for frame_event in audio_stream:
                if not self.is_receiving:
                    break

                frames_received += 1

                # Extract audio data from frame
                audio_data = np.frombuffer(frame_event.frame.data, dtype=np.int16)

                # Push audio directly to playback if callback is set
                if self.audio_output_callback:
                    try:
                        written = self.audio_output_callback(audio_data)
                        if written < len(audio_data):
                            logger.warning(
                                f"Playback buffer overflow: "
                                f"{len(audio_data) - written} samples dropped"
                            )
                    except Exception as e:
                        logger.error(f"Error pushing audio to playback: {e}")

                # Apply reverse stream processing for echo cancellation
                # For reverse stream, we need to process it in chunks too
                if self.apm and len(audio_data) == self.apm_frame_size:
                    try:
                        render_frame = rtc.AudioFrame(
                            data=audio_data.tobytes(),
                            sample_rate=self.sample_rate,
                            num_channels=self.num_channels,
                            samples_per_channel=len(audio_data),
                        )
                        self.apm.process_reverse_stream(render_frame)
                    except Exception as e:
                        if frames_received <= 10:
                            logger.warning(f"Error processing reverse stream: {e}")

                # Log progress
                if frames_received <= 5:
                    logger.debug(f"Received frame {frames_received} from LiveKit")
                elif frames_received % 100 == 0:
                    logger.debug(f"Received {frames_received} frames from LiveKit")

        except Exception as e:
            logger.error(f"Error receiving audio from LiveKit: {e}")
        finally:
            self.is_receiving = False
            <http://logger.info|logger.info>(f"Stopped receiving audio. Total frames: {frames_received}")
t
Thank you so much for sharing your code with me. From what I see, the audio pitch issue might be happening because of mismatched sample rates or how the audio frames are handled during playback or processing. The
BackgroundAudioPlayer
uses the
AudioMixer
, which outputs audio frames with specific settings like sample rate and number of channels. To help you get the playback just right and avoid any pitch issues, here are a few things to check: • Make sure that the
sample_rate
and
num_channels
in your
rtc.AudioStream
match what the
BackgroundAudioPlayer
is using. • Take a look at whether the audio frames are being resampled or changed before playback. Sometimes, if the sample rate doesn’t match during playback, it can cause the pitch to shift. • Double-check that your
audio_output_callback
is processing the audio data without changing its timing or sample rate. If you can share the sample rate and channel setup that your
BackgroundAudioPlayer
uses, I’d be happy to help you adjust your processing to fit perfectly. Please feel free to reach out if you want to go over this together or if there’s anything else I can assist with!
a
oh forgot, this is the audio configuration:
Copy code
# Audio configuration
        self.sample_rate = 48000
        self.num_channels = 1
        self.apm_frame_size = 480  # APM expects exactly 10ms at 48kHz
t
Thank you so much for sharing your audio configuration! It’s great to see that the
BackgroundAudioPlayer
in LiveKit uses the same default sample rate of 48000 Hz and 1 channel, just like your setup. This match usually helps avoid pitch problems that come from mismatched sample rates or channel counts. If you’re still noticing pitch issues, it might be worth taking a closer look at how the audio frames are being handled or processed. Here are a couple of things to double-check: • Make sure the
audio_output_callback
plays the audio frames without changing their timing or sample rate. • Verify that the
apm.process_reverse_stream
is set up correctly and isn’t unintentionally modifying the audio frames. If the problem continues, please don’t hesitate to reach out. I’m here to help, and we can definitely dig deeper together to find a solution. Sources: livekit-plugins/livekit-plugins-playai/livekit/plugins/playai/tts.py
a
audio_output_callback
without the BackgroundAudioPlayer, plays agent audio correctly pitched •
apm.process_reverse_stream
is the one from Livekit RTC for AudioProcessingModule
t
Thank you so much for confirming the details! Based on what you've shared and your setup, here are some possible causes and friendly troubleshooting steps to help with the pitch issue: • Sample Rate Mismatch: It’s a good idea to double-check that all your audio files or streams used with the
BackgroundAudioPlayer
are pre-converted to 48kHz, mono format. Sometimes, mismatched sample rates can sneakily cause pitch distortion. • Custom Audio Processing: If you’re using any custom audio processing like echo cancellation or noise suppression, take a moment to verify that it’s not unintentionally changing the pitch. For example, check if the
AudioProcessingModule
might be applying any unexpected changes. • Buffer Management: Keep an eye out for buffer underruns or overflows in the
BackgroundAudioPlayer
. Sometimes, increasing the
queue_size_ms
in the
AudioSource
configuration can help keep the frame delivery smooth and stable. • Volume Scaling: The
BackgroundAudioPlayer
does apply volume scaling, which could potentially introduce artifacts if not managed carefully. It’s worth ensuring this isn’t impacting the audio output. If none of these suggestions seem to fix the issue, could you please let me know if you’re using any custom audio processing delegates or third-party pitch modification libraries in your setup? That info would really help us get to the bottom of this together. Sources: Speech &amp; Audio Processing | livekit/livekit_composite | DeepWiki | Sarvam TTS integration guide | LiveKit Docs | livekit-plugins/livekit-plugins-inworld/livekit/plugins/inworld/tts.py | realtime/openai-realtime-pitch-shift.py | Sources/LiveKit/Audio/Manager/AudioManager.swift
a
how to check the SampleRateMismatch?
t
To help you check for sample rate mismatches in LiveKit audio streams, here are a few friendly tips: • Take a look at the
sample_rate
property of your
AudioFrame
objects and see if it matches the expected rate, like 48000 Hz for the
BackgroundAudioPlayer
. • Try using LiveKit’s handy built-in functions such as
combineAudioFrames
or
mergeFrames
. These will actually let you know by throwing errors if there’s a mismatch. • Keep an eye on your logs or any exceptions in your processing pipeline—LiveKit is pretty good at validating sample rate consistency and will alert you if something’s off. If you do spot any mismatches, a great way forward is to use an audio resampler to convert the audio to the right sample rate before playback or further processing. And please, don’t hesitate to reach out if you’d like some help with implementing these checks or setting up resampling—I’m here to support you! Sources: livekit-rtc/livekit/rtc/utils.py | agents/src/utils.test.ts | tests/test_vad.py | packages/livekit-rtc/src/audio_frame.ts
a
what is the combineAudioFrames?
Received frame with length: 480, sample rate: 48000, num_channels: 1, samples_per_channel: 480
t
I'm happy to help you understand the
combineAudioFrames
function in LiveKit! This handy utility merges multiple
rtc.AudioFrame
objects into one by concatenating their audio data. It makes sure all input frames share the same
sampleRate
and
numChannels
, and will let you know with an error if something doesn’t match up. People often use this for putting together TTS output, short audio captures, or even debugging their workflows. When working in Python, you might see this function referred to as
merge_frames
or
combine_frames
. It’s designed to efficiently allocate memory for the combined frame and copy data from each input frame. The final frame keeps the same
sampleRate
and
numChannels
, with
samplesPerChannel
being the total of all input frames combined. If you’re dealing with audio frames and want to combine them smoothly, this function is a great tool to keep things consistent and simplify your process. Feel free to reach out if you’d like me to share an example or dive into more details—I’m here to help! Sources: packages/livekit-rtc/src/audio_frame.ts | livekit-rtc/livekit/rtc/audio_mixer.py | plugins/openai/src/realtime/realtime_model.ts | livekit-plugins/livekit-plugins-silero/livekit/plugins/silero/vad.py | Capturing the First Few Seconds of a Call for Processing
a
i'm testing different things and i can't seem to be hitting the issue
t
I'm really glad to hear that the issue hasn’t come up again during your tests. If the pitch problem does pop up once more, please don’t hesitate to share any new details or settings with me—I’m here to help you every step of the way. And of course, if there’s anything else you’d like to discuss or need help with, just let me know!
a
no i meant, i can't finde the root of the issue
t
I completely understand how tricky this pitch issue can be when the root cause isn’t immediately clear. Let’s try narrowing it down together with a few focused steps: • Log Audio Frame Details: It might help to add detailed logging for each
AudioFrame
you receive—things like
sample_rate
,
num_channels
, and
samples_per_channel
. This way, we can confirm if everything stays consistent across frames. • Test Without APM: Temporarily turning off the
AudioProcessingModule
could reveal if it’s causing any unexpected changes. • Isolate BackgroundAudioPlayer: Try testing the
BackgroundAudioPlayer
on its own by playing a known audio file and checking the pitch to see if it sounds right. • Resample Input Audio: If you think sample rates might be mismatched, explicitly resampling all input audio to 48kHz before processing or playback can often help. • Check Volume Scaling: Also, double-check that volume adjustments within the
BackgroundAudioPlayer
aren’t introducing any artifacts. If you’d like, I’m here to walk you through any of these steps or help you analyze specific parts of your setup. Just let me know how you want to move forward—I’m happy to assist!
a
definitely there's something in the BackgroundAudioPlayer that is changing something on the way LK sends audio stream because if i disable it on the room, all sounds correctly
t
Thanks so much for sharing your observations! The
BackgroundAudioPlayer
does adjust audio streams in a few ways, which might explain the changes you're noticing: • Audio Mixing: It uses
rtc.AudioMixer
to blend ambient, thinking, and on-demand audio streams into one output, which is then sent to the room. • Volume Adjustments: The player applies volume scaling to audio frames, so the audio signal might be a bit different before it’s sent out. • Buffering: The
rtc.AudioSource
includes a 400ms buffer, which can sometimes cause slight delays or timing shifts. • Agent State Synchronization: It can play sounds like typing noises based on the agent’s state, which dynamically changes the audio stream. All these features aim to improve your audio experience, though they might sometimes affect the stream in unexpected ways. I’d be happy to help you test specific settings or dig deeper to pinpoint the issue. Just let me know how you’d like to move forward—I’m here to support you! Sources: examples/drive-thru/drivethru_agent.py | Agent speech and audio | LiveKit Docs | Speech &amp; Audio Processing | livekit/livekit_composite | DeepWiki
a
which of this could be the potential issue?
t
Thank you for sharing your observations! It sounds like the pitch issue might be coming from a few possible areas: • Audio Mixing: The
rtc.AudioMixer
brings together multiple audio streams, and if there’s any mismatch in sample rates or timing between them, it could impact the final sound. • Volume Adjustments: The
BackgroundAudioPlayer
adjusts volume on audio frames, and sometimes that can introduce subtle changes if not perfectly handled. • Buffering: The 400ms buffer in
rtc.AudioSource
might cause timing shifts, especially if there are any delays during processing or frame delivery. To help pinpoint what’s going on, I suggest trying these steps: 1. Turn off the ambient and thinking sounds in the
BackgroundAudioPlayer
to see if the pitch issue still happens. 2. Log the audio frames sent by the
BackgroundAudioPlayer
to check for any differences in sample rate, channels, or timing. 3. Try a simpler setup, like using just one audio file, to see if the problem is related to mixing or playback. If you’d like, I’m happy to guide you through any of these tests or help with anything else. Just let me know!
a
hey, is it possible that thinking phrases from the agent are considered as thinking state rather than speaking?
t
I don't have the answer you're looking for. You could also try asking your question: • in one of the other Slack channels or • to https://deepwiki.com/livekit/livekit_composite which is trained on all LiveKit source code If you find the answer, please post it here to help others!
a
can the background player sounds be a WAV file? is there a suggested approach to local files? if the sound is 3s what's the playout timeout suggested?