elegant-coat-98622
07/09/2025, 12:36 PMmax_cpu: 1000
and the default cpu_cost
table (roomComposite 600, track 200).
My backend fires StartRoomCompositeEgress
the moment it receives the room_started
webhook.
Under load I see livekit_node_twirp_request_status{method="StartRoomCompositeEgress",code="unavailable"}
spike and the SDK throws
twirp error unknown: no response from servers
.
How can I guarantee the first room to request recording always gets the next available egress worker, and what Prometheus or Redis pattern do you recommend to prevent duplicate start attempts?
Include code-level pointers (package/file) if relevant.tall-airline-16756
07/09/2025, 12:36 PMtall-airline-16756
07/09/2025, 12:37 PMAlreadyExists
returning ErrEgressAlreadyExists
).
Prometheus metrics like livekit_egress_available
help track resource availability, but to guarantee the first request always gets the next available egress worker, scheduling is best-effort and based on current resource usage, so proper CPU cost tuning and setting max_cpu
accordingly is critical.
Regarding duplicate prevention, the service already enforces uniqueness by egress ID internally. Using Redis or extra Prometheus patterns for deduplication is not documented as a recommended practice within LiveKit.
Relevant code can be found in the egress service handling request admission control and resource tracking (notably in packages related to Monitor and CPUCostConfig).
If you want a stronger guarantee of first-request priority, consider integrating your backend with the LiveKit API to track and queue requests before calling StartRoomCompositeEgress, using external coordination.
Sources: README.md | Self-hosting the Egress Service | LiveKit Docselegant-coat-98622
07/09/2025, 12:43 PMtall-airline-16756
07/09/2025, 12:45 PMlivekit_egress_available
that helps with autoscaling, essentially showing how much CPU each egress instance has available. Behind the scenes, they use Redis Pub/Sub to queue up egress requests, and different egress pods pick up these requests based on their current workload. The cool thing is the system can scale from zero to multiple pods. However, there's a catch - there isn't a straightforward external API where you can directly peek into the egress resource queue or manage a first-in-first-out (FIFO) queue. You can use the ListEgress
API endpoint to get a snapshot of active egress instances, but just keep in mind it's more for monitoring than precise scheduling. If you're looking to guarantee your requests are processed in order, you'll want to create a custom solution. Think of it like building a smart waiting list that checks CPU availability before sending a request. You could use something like Redis to track your requests and periodically check the livekit_egress_available
metric before calling StartRoomCompositeEgress
. Since LiveKit doesn't offer an out-of-the-box solution for this specific orchestration, rolling your own backend queue that keeps an eye on egress CPU availability is your best bet. Here's a quick pseudocode example to give you an idea: go // Pseudocode // 1. Push incoming room recording requests into Redis FIFO queue // 2. Periodically poll Prometheus metric `livekit_egress_available` // 3. If available slots > 0, pop request from Redis queue and call StartRoomCompositeEgress
If you dig into the LiveKit codebase, you'll find interesting bits around Redis queues for egress requests, how they integrate Prometheus metrics, and the ListEgress API. Hope this helps break things down in a friendly way! Let me know if you have any questions. Sources: Self-hosting the Egress Service | LiveKit Docs | Universal Egress | Egress API | LiveKit Docselegant-coat-98622
07/09/2025, 12:47 PMlivekit_egress_available
will be null untill we actually have any egress pod running, since the metric wouldn't be scarped ?
great so basically need to serialize request to avoid any race conditions i.e for fifo thingtall-airline-16756
07/09/2025, 12:47 PMlivekit_egress_available
metric - it'll show up as null if no egress pods are running, which makes total sense. What this means for you is that your backend needs to be a bit clever about handling serialization and queuing requests. This helps avoid those tricky race conditions, especially when you're scaling up from zero. Think of it like making sure everyone gets served in the right order at a coffee shop! Your best bet is to have your backend keep an eye on scaling events or even proactively spin up pods before sending out those egress start requests. Then, you'll want to do a quick check on livekit_egress_available
once the pods are up and running smoothly. By taking this approach, you'll make sure everything processes smoothly and you're using your resources super efficiently. Pretty neat, right?elegant-coat-98622
07/09/2025, 12:57 PMlivekit_egress_available
I can just rely on the not_found
or unavailable
twirp error to make decision on manually increasing replica count of egress pod, correct ?
(ofcourse while serialising the acquire thing)tall-airline-16756
07/09/2025, 12:58 PMlivekit_egress_available
can be super helpful for making autoscaling decisions. Just be careful ā using Twirp errors like not_found
or unavailable
as direct scaling triggers could potentially introduce some tricky timing issues. My recommendation? Focus on proactive metrics like CPU usage and that livekit_egress_available
gauge. Combining these with a smart backend request queuing strategy will give you the most reliable scaling approach. Think of it like having a helpful traffic controller managing your resources. If you're really wanting a backup plan, you could always create some custom backend logic to manually trigger scale-ups. Just keep in mind this would be your own special sauce, not something built directly into LiveKit. Hope this helps make things a bit clearer! Feel free to reach out if you need more guidance. Sources: Self-hosting the Egress Service | LiveKit Docs | livekit/livekit_egress.twirp.goelegant-coat-98622
07/09/2025, 1:00 PMtall-airline-16756
07/09/2025, 1:01 PMtall-airline-16756
07/09/2025, 1:01 PMelegant-coat-98622
07/09/2025, 1:32 PM