Can-o-worms #643.2. How do you folks handle mostly...
# cfml-general
d
Can-o-worms #643.2. How do you folks handle mostly-regularly-scheduled downtime for system updates? Do you... • Announce them via bulk emails to users (who probably don't read them) • Have code to announce them on the login screen (where they're probably ignored) according to your update schedule • Manually announce them on the login screen • Update a second system offline, and more-or-less-hot-swap it into production • None of the above • What are system updates? ADDENDUM The recent sticky wicket is SQL Server updates, one of which was still running after 2 1/2 hours. ColdFusion, Java, and Windows updates don't take nearly that long.
r
Bulk email to users.
d
Do you think anyone reads them?
s
My bank always has banners up for a day or two before a downtime, on the login and logged-in landing pages.
Never get an email about it.
r
Yes, we actually have a decent read rate on most of our emails.
c
Our normal business hours end at 5pm, so I typically do Windows Updates, ColdFusion updates, and the like starting around 5:30pm on Fridays. We don't notify users, and it hasn't been an issue. When we are doing major upgrades (like upgrading OS, new versions of CF or SQL Server or our GIS software), we notify via email a few days ahead of the upgrade with the planned outage time window, and follow up when work is completed. Those kind of upgrades we do on our testing tier first to identify and address any issues, and then address the issues in production before doing the upgrade there.
d
Yes definitely for cf/java we always do dev servers, run for a week or two there, keeping an eye open there, and also here. We do major new CF versions on roughly the same cadence as new servers, so new CF comes on a whole new box that's been running for a fair while in dev, and new db servers too, also on new boxes. So the only go-time tasks are (off the top of my head) copying production db latest changes to the new db boxes, and changing of the network pointers so user-world plugs into the new CF servers. Far as email read rate goes, we see activity plowing right through the announced downtime windows, right up until shutdown, and in the cracks between successive restarts. These are busy healthcare and social services professionals, who can't stop working just because IT says we're going down, and they take our advice as our 'druthers, not gospel.
q
For engine/platform updates, most of our stuff is done with containers these days, so updates are rarely impactful if we do them right. For SQL updates (and things like where we are updating to containers) and migrations where we have to shut down services, we point the site to a static version of itself, with a disabled login page that says the site is under maintenance. That can be done on our firewall/load balancer, or in worst case, DNS to a 3rd party site.
For highly active sites, we also utilize a service status page that lists outages and planned maintenance. people tend to find those quickly.
e
Many of the systems do not allow downtime, so there is a lot of hot patching and using hot-swap systems to keep everything up. As for scheduled maintenance on systems that are allowed downtime, it's a standard email to affected parties a week prior with a large buffer to allow for a complete and total screw-up, requiring a reload/reimage from backup. The second email is a few days before all the systems are down, the third email is a few hours before the maintenance, an email at the start of the maintenance to internal departments that need to know, such as the helpdesk, and email after to everyone stating the system is back online. As for work performed, many routine upgrades or patches are updated, tested, cloned the image, and deployed; we rarely have to touch physical servers beyond installation or replacement.