Did you have a server in Linode's Atlanta datacenter? How was the last 48 hours?
Downtime sucks. Your beautiful sites -- those sites that your clients pay for, that make them money -- are only valuable when they're up. Having a server go offline is a painful reminder that "the cloud" and "the internet" is still, in the end, running on a real computer somewhere, an arbritrary, unfair, stupid machine with a perverse desire to crash and ruin your day.
People lost money during this outage. Were your clients' sites down? Did they call to "thank" you for that? Your budding SaaS go offline for 2 days? Hope it wasn't mission critical. Someone, somewhere on this green earth, woke up on Thursday morning and said, "hey, let's go crash Linode," and because of that? People lost money. People lost sleep. People lost trust.
For my part, I've been in high-intensity server situations, and reading the subtext between the 36-hour saga of the Atlanta downtime, I have every reason to believe that Linode did their best in what was a very bad situation. As people commented on r/sysadmin: this could happen on any cloud or VPS provider. If an attacker with deep pockets decides to take down part of your infrastructure, you're going to have a bad day. Hell, even AWS goes down from time to time.
What if your server going down didn't have to be a 5 alarm fire? What if, as a smart business owner, you knew the risks involved in putting your website on the internet, and you knew how to proactively deal with a server going down? You could sleep well at night knowing that, even if your server tragically caught fire, your data was safe. Instead of the awful "floor dropping" gut lurch and worry about your business, what if a critical system crashing could be no big deal?
To spoil the conclusion: if you want to sleep well at night, you need a disaster plan. It doesn't have to be fancy, but there are questions to answer immediately and changes to make now that will undeniably save your bacon when the proverbial crap hits the fan.
Making a Good Plan
At bare minimum, you need to grab a pen and scribble these questions (and their answers) down on a piece of paper:
- Who is going to lose sleep when the site's down?
- Are we taking backups, and are they good enough?
- Do we know how to set up a new server from scratch?
- How much downtime are we OK with?
Now these are BIG questions. I could write a couple thousand words on each. If you're thorough, you're going to end up with a LOT more than a page worth of answers, and several weeks worth of work getting documentation, testing backups, and putting together a final detailed plan.
But, if you're like most companies I've worked with, you haven't got weeks to do this -- you need a strategy, and you need it now. In fact, did your server go down? You needed it yesterday.
So let me share with you a little play from my book, something I do with all of my clients -- it's an easy worksheet that makes sure that your bases are covered when it comes to server failure. Or, if they're not covered, you'll know where you need to start!