/ dns

The site's down! Quick! Tell the customers!

Today Zerigo, a major DNS provider, went offline for an extended vacation.

It's going on 10 hours since they've gone down and their status page and Twitter haven't been updated in over two hours.  Does this make me happy as a customer?  (NO, NO IT DOES NOT AT ALL — ed.)

Long-term success comes from making customer relationships the HIGHEST priority.  I'd even place it one peg higher than actually fixing the problem.  Of course the problem needs to be fixed, and right quick!  But unless the problem is truly a "one minute fix," sending information to the customers should be the very first thing that an Ops team does!

"The site's down, I don't know why"

This is what's running through your customer's mind.  All it takes is ONE tweet, one update to the status page: "Hey, we know this is down!  We're on it and will let you know what's up ASAP."  In the absence of that tiny scrap of information customers are left with… vague uncertainty, and the seeds of distrust.

Don't let that distrust grow for even a second—tell the customer NOW what's going on and fix it as soon as possible.  If it takes longer than 10–15 minutes to fix, update customers again so they know what's happening.  Keep updating them until the problem's completely fixed!

Communicate Consistently

This is something we try very hard to do at my current workplace.  We have businesses relying on us for their main revenue source, and we necessarily take downtime very seriously.

But we still let the customer know what's happening—almost as soon as we know it ourselves!  This means working together as a team, planning in advance, and having tools and systems in place to make this kind of communication as easy as possible.

"Tell the customer first" is the opposite of usual Ops behavior—when downtime hits, updating the status page is the LAST thing that's usually thought of.  FIRST, put out the fire, then update status and Twitter when the flames die back.  If you were quick enough, maybe no one noticed the downtime and you don't have to update at all!

What's the problem with this approach?  It's utterly invisible to the customer.  The customer is indeed well served by a quick resolution to the problem.

But what if it's not quick?  And what about the 100 people that DID notice during that "one minute" that aren't going to email you demanding explanation?  How many will simply go elsewhere?

Empower your Ops

This is why Ops teams should have access to your company's Twitter feed.  This is why you need a status page.  Posting to these things should be as low-friction as possible.  This is why you need a culture of sharing critical information with customers as soon as you're made aware of it.

Customers can forgive downtime, even extended downtimes, so long as there's a good explanation, a sincere apology, and steps taken to ensure that it won't happen again.  Communication goes a long way toward keeping good will, and I'd rather err on the side of over-communication than leave people scratching their heads, gnashing their teeth, and pounding their keyboards in frustration.