Saturday afternoon (in the US), ZipCar's website and phones went down for a couple hours. For people who were, y'know, relying on them for transportation, this was a very bad time. On top of that, they said very little during the downtime, leaving a bunch of customers wondering "WTF IS HAPPENING OVER THERE?!"
So, without further ado, let me present:
Downtime Anti-Patterns, courtesy of ZipCar1.) Make a service that people rely on every day for something fundamental, like getting from point A to point B.
2.) Have your web site's DNS entry on a 5 minute TTL, but don't do any kind of DNS failover.
3.) Put your VOIP phone system in the same datacenter as your web servers. That way, when there's a DC-wide outage, no one can get in touch!
4.) Market to a hip, connected, social-network-using market and then, when the shit hits the fan, don't update them for an hour or so. They'll understand.
5.) Don't have a public "status" page. People don't need to know what's going on with your service.
(I strongly disagree with the conclusion that anyone should be fired. But look how upset people are, and rightly so—they have no idea what's going on!)
6.) Apologize publicly but don't make a public postmortem explaining why this happened and what steps have been taken to fix the problem.
What Should Have HappenedFirst off, let me say that I respect the ZipCar IT team, and any team that has to put out server-related fires. It's a tough job that can mean seriously long hours, lost weekends, dropping out of a family gathering to answer an urgent page, and more. Keeping information systems running and available to the world 24/7/365.25 is **NEVER** an easy feat.
That said, here's what I would have done differently:
1.) Give the IT team access to the company's social network accounts, give them tools to post status updates quickly, and teach them to update customers as soon as a problem's detected.
2.) Use an automated system to point DNS entries to a "sorry, we're down, please see http://status.zipcar.com" page running on a commodity VPS in a completely different datacenter. Provide useful information to the customer RIGHT AWAY, and don't leave them wondering why the page isn't loading.
3.) Have a status.zipcar.com already built.
4.) Do a PUBLIC postmortem analysis on the problem — What happened? How did we hear about it? Why didn't we know sooner? What surprised us? What are we doing to make sure this doesn't happen?
Look at those 4 points — I'm not criticizing the technical ability of the team and their ability to build redundant systems. What's in view here is their public reply, or rather, the lack thereof, which obviously and publicly left a number of their customers upset.
I understand that ZipCar's a large company, that they have media and press people, social network consultants, policies and standards for all public communication. Are they really so large that one person can't whip out their iPhone and post a quick tweet within 5 minutes of a system fault?
Here's my maxim:
Customers can forgive and forget downtime, but communication misfires will always be remembered.
Hey, remember when Netflix was going to stop sending out DVDs and change their prices? Yeah, that went over well.
But what does this mean for a small SaaS company?Most of us don't have as many customers as ZipCar. Heck, we wouldn't know what to DO with that many customers, except maybe swim in pools of money like Scrooge McDuck. There's an unstated flipside to antipattern #1: build a service where downtime doesn't ruin a person's day.
When ZipCar goes down, people can't find or reserve cars. Appointments get missed. People get very upset.
The same happens for an e-commerce service: FAILING to take money does not make happy customers.
Point of sales systems? Your client's customers can't buy things, which makes them all very unhappy.
Medical services? Air traffic control? People are gonna die.
But what about when Basecamp goes down? Oh no, I can't manage my project right now, oh well… better get back to work. I can sort it out later.
Freckle? Harvest? Uh oh, I can't track my hours. I'll just write them down on a piece of paper and enter them later.
When a service like that goes down, it's at most an inconvenience. That doesn't exempt any of those companies from full and immediate communication when there is a problem. However, no one's day is completely ruined by timesheet problems.
My point is this: if you're a small company, think about what happens when your customer can't get to your service. Empathize, put yourselves in the customer's suede loafers, and think — man, that site's down, so now is my day totally ruined?
Just starting out? DON'T make a service that's going to need impeccable uptime. Actively avoid it, and mercilessly kill features that take steps in that direction.
Recognize that you don't have the resources to make it happen — people to work on servers, cash for redundant servers in multiple datacenters, etcetera, etcetera. Uptime is expensive.
Remember the key lesson here: customers can forgive downtime. Be honest and humble when you screw up, and don't pretend to be a massive company. In fact, given ZipCar's example, do the complete opposite of what big companies do. Learn from your mistakes, and share what you learned with your customers. At the end of the day, they're the people who make or break your business.