Hello! I have a working real-time transcription setup using TrackEgress. Here is my current flow:
A track_published webhook triggers a TrackEgress process.
Egress streams the raw PCM audio (48kHz, stereo) over a WebSocket to my Node.js server.
My Node.js server uses FFmpeg to downsample the audio to 16kHz mono.
The processed audio is then streamed to the Google Speech-to-Text API.
This works perfectly, but it creates a separate Egress and transcription session for every participant, which is resource-intensive.
I want to switch to a more efficient model using RoomCompositeEgress to get a single, mixed audio stream for the entire room. My goal is still real-time transcription.
I have the following questions about the best way to architect this:
Real-time Output: What is the recommended output for RoomCompositeEgress to achieve real-time, low-latency audio streaming? The documentation primarily shows file outputs (.mp4, .ogg). Can it stream to a WebSocket like TrackEgress?
RTMP Streaming: If WebSocket is not an option, can RoomCompositeEgress stream the mixed audio to an RTMP endpoint?
Node.js Integration: If RTMP is the recommended way, what is the best practice for receiving this RTMP stream in a Node.js application? I am considering using a library like node-media-server. Is this a good approach?
Audio Processing: Once the RTMP stream is received by my Node.js server, what is the best way to extract the raw audio from it and pipe it into my ffmpeg process for transcoding (downsampling to 16kHz mono for Google STT)?
Audio Format: What is the exact audio format (codec, sample rate, channels) of the mixed audio that RoomCompositeEgress sends over the RTMP stream?
Could you please provide guidance and, if possible, a small code example in Node.js for setting up the RoomCompositeEgress with an RTMP output and then consuming that stream with FFmpeg?
Thank you!