I’m running livekit/livekit-server 1.9.0 and livek...
# ask-ai
e
I’m running livekit/livekit-server 1.9.0 and livekit/egress 1.9.0 on EKS. Each egress pod has
max_cpu: 1000
and the default
cpu_cost
table (roomComposite 600, track 200). My backend fires
StartRoomCompositeEgress
the moment it receives the
room_started
webhook. Under load I see
livekit_node_twirp_request_status{method="StartRoomCompositeEgress",code="unavailable"}
spike and the SDK throws
twirp error unknown: no response from servers
. How can I guarantee the first room to request recording always gets the next available egress worker, and what Prometheus or Redis pattern do you recommend to prevent duplicate start attempts? Include code-level pointers (package/file) if relevant.
t
Thanks for reaching out, Hamza. I'm an AI assistant and I'm here to help. Give me a moment while I research this topic.
LiveKit egress uses a Monitor component to track CPU and memory usage and decide whether incoming StartRoomCompositeEgress requests can be accepted based on configured CPU cost values (e.g., roomComposite = 600) and max CPU limits per pod. The system prevents duplicate start attempts by checking if an egress ID already exists before starting processing, returning an error if so (check in egress service code around the call to
AlreadyExists
returning
ErrEgressAlreadyExists
). Prometheus metrics like
livekit_egress_available
help track resource availability, but to guarantee the first request always gets the next available egress worker, scheduling is best-effort and based on current resource usage, so proper CPU cost tuning and setting
max_cpu
accordingly is critical. Regarding duplicate prevention, the service already enforces uniqueness by egress ID internally. Using Redis or extra Prometheus patterns for deduplication is not documented as a recommended practice within LiveKit. Relevant code can be found in the egress service handling request admission control and resource tracking (notably in packages related to Monitor and CPUCostConfig). If you want a stronger guarantee of first-request priority, consider integrating your backend with the LiveKit API to track and queue requests before calling StartRoomCompositeEgress, using external coordination. Sources: README.md | Self-hosting the Egress Service | LiveKit Docs
e
"If you want a stronger guarantee of first-request priority, consider integrating your backend with the LiveKit API to track and queue" give me a sample, what to query and all. also, I intend to scale 0 -> N egress pods, to maximize cost saving, using warm ppol instance to give best UX. should I manually query livekit_egress_available and implement from FIFO redis pattern or can I achieve the same via some out of the box stuff offered by livekit
t
Hey there! Let me break down how LiveKit handles egress scaling in a way that might help you out. LiveKit has this cool Prometheus metric called
livekit_egress_available
that helps with autoscaling, essentially showing how much CPU each egress instance has available. Behind the scenes, they use Redis Pub/Sub to queue up egress requests, and different egress pods pick up these requests based on their current workload. The cool thing is the system can scale from zero to multiple pods. However, there's a catch - there isn't a straightforward external API where you can directly peek into the egress resource queue or manage a first-in-first-out (FIFO) queue. You can use the
ListEgress
API endpoint to get a snapshot of active egress instances, but just keep in mind it's more for monitoring than precise scheduling. If you're looking to guarantee your requests are processed in order, you'll want to create a custom solution. Think of it like building a smart waiting list that checks CPU availability before sending a request. You could use something like Redis to track your requests and periodically check the
livekit_egress_available
metric before calling
StartRoomCompositeEgress
. Since LiveKit doesn't offer an out-of-the-box solution for this specific orchestration, rolling your own backend queue that keeps an eye on egress CPU availability is your best bet. Here's a quick pseudocode example to give you an idea:
Copy code
go // Pseudocode // 1. Push incoming room recording requests into Redis FIFO queue // 2. Periodically poll Prometheus metric `livekit_egress_available` // 3. If available slots > 0, pop request from Redis queue and call StartRoomCompositeEgress
If you dig into the LiveKit codebase, you'll find interesting bits around Redis queues for egress requests, how they integrate Prometheus metrics, and the ListEgress API. Hope this helps break things down in a friendly way! Let me know if you have any questions. Sources: Self-hosting the Egress Service | LiveKit Docs | Universal Egress | Egress API | LiveKit Docs
e
great, the catch here is I seee is that
livekit_egress_available
will be null untill we actually have any egress pod running, since the metric wouldn't be scarped ? great so basically need to serialize request to avoid any race conditions i.e for fifo thing
t
Hey there! I totally get what you're dealing with. So here's the scoop on the
livekit_egress_available
metric - it'll show up as null if no egress pods are running, which makes total sense. What this means for you is that your backend needs to be a bit clever about handling serialization and queuing requests. This helps avoid those tricky race conditions, especially when you're scaling up from zero. Think of it like making sure everyone gets served in the right order at a coffee shop! Your best bet is to have your backend keep an eye on scaling events or even proactively spin up pods before sending out those egress start requests. Then, you'll want to do a quick check on
livekit_egress_available
once the pods are up and running smoothly. By taking this approach, you'll make sure everything processes smoothly and you're using your resources super efficiently. Pretty neat, right?
e
is there no redis or livekit-server event (dont quote me livekit total room one) that potentially can offer the same ? from what I understand is, instead of me querying the
livekit_egress_available
I can just rely on the
not_found
or
unavailable
twirp error to make decision on manually increasing replica count of egress pod, correct ? (ofcourse while serialising the acquire thing)
t
Hey there! I totally understand you're looking for a clear way to handle egress scaling in LiveKit. Right now, the knowledge base doesn't have a straightforward, built-in method for triggering egress pod scaling based on specific events. But don't worry, I've got some friendly advice to help you navigate this. While you can't directly rely on Twirp errors to scale your pods, you've got some great metrics to work with! Things like
livekit_egress_available
can be super helpful for making autoscaling decisions. Just be careful – using Twirp errors like
not_found
or
unavailable
as direct scaling triggers could potentially introduce some tricky timing issues. My recommendation? Focus on proactive metrics like CPU usage and that
livekit_egress_available
gauge. Combining these with a smart backend request queuing strategy will give you the most reliable scaling approach. Think of it like having a helpful traffic controller managing your resources. If you're really wanting a backup plan, you could always create some custom backend logic to manually trigger scale-ups. Just keep in mind this would be your own special sauce, not something built directly into LiveKit. Hope this helps make things a bit clearer! Feel free to reach out if you need more guidance. Sources: Self-hosting the Egress Service | LiveKit Docs | livekit/livekit_egress.twirp.go
e
alright, so no redis queue, egress queue depth etc events that we can tap into for this ? my complain is, why does startEgress rpc has to fail, cant it just register the request and then later the infra matches the requirement ? or is there any wisdom behind it
t
Hey there! šŸ‘‹ I totally get the frustration with how LiveKit's StartRoomCompositeEgress RPC is handling capacity limits right now. It's a bit of a bumpy experience when the system just immediately fails instead of giving you some breathing room. So here's the deal: the current setup is designed to give you instant feedback about resource availability. The EgressMonitor is managing things behind the scenes, but the RPC is pretty strict - it only accepts requests if everything lines up perfectly right that moment. Unfortunately, there aren't any built-in event metrics or Redis queue depth signals you can tap into to work around this. Which means your backend will need to get a bit creative with queuing and retry strategies to handle these capacity constraints smoothly. While this isn't ideal, it could actually be a great opportunity to suggest an improvement to the LiveKit team. Imagine if they added more asynchronous egress scheduling in future versions - that would be a game-changer! For now, you'll want to build in some robust request management on your end to handle these moments gracefully. Think of it like having a flexible backup plan. 😊 Sources: egress/errors.go | pkg/stats/monitor.go
I don't have the answer you're looking for. You could also try asking your question: • in one of the other Slack channels or • to https://deepwiki.com/livekit/livekit_composite which is trained on all LiveKit source code If you find the answer, please post it here to help others!
e
asynchronous egress scheduling any future plans or in roadmap discussion on this