This message was deleted.
# cloud
s
This message was deleted.
m
hey @red-country-59682, we are investigating this right now
r
Thanks @magnificent-art-43333! Can we expect updates on this channel?
m
yeah, we’ll share an update if we discover an issue — so far we’re not seeing anything
we did have an incident about two hours ago, we will share a longer message about this tomorrow
r
Ok, i‘m pretty sure this is related to what we‘re observing now. Unfortunately, it looks like our control user was/is not able to reconnect. We‘ll have to verify during the next break (a manual restart is currently not an option because a live event is happening as we speak)
d
they are still unable to reconnect?
besides the incident earlier, there are no other known operational issues right now
r
Well, there is a control user, joined via go-sdk. According to the dashboard (sessions/RM_JMQgSNH9ppZ3/participants/PA_a93nY4PMmuHZ) it could resume the connection, but in reality (js-sdk) the participant is no longer listed in the room.
If you no longer experience any issues, I‘m sure that the problem will be solved by restarting our backend (and so forcing a real reconnect), but for „live reasons“ that‘s currently not possible 🫣
d
I see. if the user having connection problems would reconnect, it should be working as intended. But I understand if it's not simple to trigger that in the middle of a session
we'll be sharing a longer message about the incident tomorrow. my apologies for the disruption. we will do everything we can to stabilize things asap.
2
r
I will try a forced reconnect during the next break and report back. Thanks for your quick support and don‘t sweat it, s..tuff happens – and sometimes it’s bound to happen during live events 😅
❤️ 2
m
Appreciate you saying that Luis. We really do want and aim for it to never happen. So two back-to-back incidents in two days is very upsetting/disappointing.
1
r
Yeah, I can sympathize with you on that – and I trust that you’ll get back on track. Fortunately, we have backup systems in place because we also don‘t want our customers to experience disruptions. Our backup systems may fail too, of course, but let‘s just hope they don‘t fail at exactly the same time as yours.
🙏 2
As expected, a forced reconnect fixed everything. Thanks again, and I‘m looking forward to the post mortem tomorrow.
👍 1
It looks like we had a similar disconnect issue around 40 minutes ago. Is it possible that you‘re having another outage? The status page shows nothing so far, but those are sometimes not updated in real time.
b
I have the same situation happened @red-country-59682.. where I have the Go server as the sate manager, I'm maybe thinking.. a backup State Manager might be useful for this specific case.. allowing State Manager rebooting without losing the states.
d
@red-country-59682 we have seen higher error rate on a couple of servers earlier today. we are investigating the root cause now.
🙏 1
r
@dry-elephant-14928 I just read the postmortem, and neither in it nor anywhere else on the status page was a mention of those higher error rates you mentioned. Can you tell me more about that?
m
Hey @red-country-59682, I’m sure David will share a bit more on that when he has a moment, but just for clarity, the postmortem covers specifically the global and regional outages.
d
@red-country-59682 sorry for the delays getting back to you. As Russ mentioned, because it's a isolated server locking up, it was not considered a widespread outage. The reason turned out to be a Go channel in the signal connection handler that was too small for the message rate that was taking place on that server. It caused it to overwhelm the handler and thus some messages were dropped as a result. We have fixed that issue and all of our servers have been updated. 😮‍💨
🙏 1
❤️ 1