This message was deleted LiveKit Community #cloud

Join Slack

This message was deleted.

# cloud

steep-balloon-41261

03/28/2023, 7:47 AM

This message was deleted.

magnificent-art-43333

03/28/2023, 7:50 AM

hey @red-country-59682, we are investigating this right now

red-country-59682

03/28/2023, 7:51 AM

Thanks @magnificent-art-43333! Can we expect updates on this channel?

magnificent-art-43333

03/28/2023, 7:53 AM

yeah, we’ll share an update if we discover an issue — so far we’re not seeing anything

magnificent-art-43333

03/28/2023, 7:55 AM

we did have an incident about two hours ago, we will share a longer message about this tomorrow

red-country-59682

03/28/2023, 8:06 AM

Ok, i‘m pretty sure this is related to what we‘re observing now. Unfortunately, it looks like our control user was/is not able to reconnect. We‘ll have to verify during the next break (a manual restart is currently not an option because a live event is happening as we speak)

dry-elephant-14928

03/28/2023, 8:09 AM

they are still unable to reconnect?

dry-elephant-14928

03/28/2023, 8:10 AM

besides the incident earlier, there are no other known operational issues right now

red-country-59682

03/28/2023, 8:13 AM

Well, there is a control user, joined via go-sdk. According to the dashboard (sessions/RM_JMQgSNH9ppZ3/participants/PA_a93nY4PMmuHZ) it could resume the connection, but in reality (js-sdk) the participant is no longer listed in the room.

red-country-59682

03/28/2023, 8:15 AM

If you no longer experience any issues, I‘m sure that the problem will be solved by restarting our backend (and so forcing a real reconnect), but for „live reasons“ that‘s currently not possible 🫣

dry-elephant-14928

03/28/2023, 8:16 AM

I see. if the user having connection problems would reconnect, it should be working as intended. But I understand if it's not simple to trigger that in the middle of a session

dry-elephant-14928

03/28/2023, 8:17 AM

we'll be sharing a longer message about the incident tomorrow. my apologies for the disruption. we will do everything we can to stabilize things asap.

➕ 2

red-country-59682

03/28/2023, 8:19 AM

I will try a forced reconnect during the next break and report back. Thanks for your quick support and don‘t sweat it, s..tuff happens – and sometimes it’s bound to happen during live events 😅

❤️ 2

magnificent-art-43333

03/28/2023, 8:20 AM

Appreciate you saying that Luis. We really do want and aim for it to never happen. So two back-to-back incidents in two days is very upsetting/disappointing.

➕ 1

red-country-59682

03/28/2023, 8:29 AM

Yeah, I can sympathize with you on that – and I trust that you’ll get back on track. Fortunately, we have backup systems in place because we also don‘t want our customers to experience disruptions. Our backup systems may fail too, of course, but let‘s just hope they don‘t fail at exactly the same time as yours.

🙏 2

red-country-59682

03/28/2023, 8:44 AM

As expected, a forced reconnect fixed everything. Thanks again, and I‘m looking forward to the post mortem tomorrow.

👍 1

red-country-59682

03/28/2023, 10:17 AM

It looks like we had a similar disconnect issue around 40 minutes ago. Is it possible that you‘re having another outage? The status page shows nothing so far, but those are sometimes not updated in real time.

brief-refrigerator-69901

03/28/2023, 3:18 PM

I have the same situation happened @red-country-59682.. where I have the Go server as the sate manager, I'm maybe thinking.. a backup State Manager might be useful for this specific case.. allowing State Manager rebooting without losing the states.

dry-elephant-14928

03/28/2023, 4:28 PM

@red-country-59682 we have seen higher error rate on a couple of servers earlier today. we are investigating the root cause now.

🙏 1

red-country-59682

03/29/2023, 7:24 AM

@dry-elephant-14928 I just read the postmortem, and neither in it nor anywhere else on the status page was a mention of those higher error rates you mentioned. Can you tell me more about that?

magnificent-art-43333

03/29/2023, 4:34 PM

Hey @red-country-59682, I’m sure David will share a bit more on that when he has a moment, but just for clarity, the postmortem covers specifically the global and regional outages.

dry-elephant-14928

03/30/2023, 6:08 AM

@red-country-59682 sorry for the delays getting back to you. As Russ mentioned, because it's a isolated server locking up, it was not considered a widespread outage. The reason turned out to be a Go channel in the signal connection handler that was too small for the message rate that was taking place on that server. It caused it to overwhelm the handler and thus some messages were dropped as a result. We have fixed that issue and all of our servers have been updated. 😮‍💨

🙏 1

❤️ 1

Open in Slack

Previous Next