Acceptable Downtime

Brad Feld wrote a post yesterday titled “Acceptable Downtime”, where he explains that he has a position on the board of a startup which is considering adding redundancy to their web based service to mitigate the possibility of a catastrophic failure of their systems.

I have found that consumers are much more accepting of downtime than enterprises. When working with the enterprise, especially something like a Telco that offers near 100% uptime itself, their expectation is automatically delivering 5 9’s of uptime.

I think the more important part that is most often overlooked is the calculation of uptime. Many many companies do not include scheduled downtime for maintenance in their downtime metrics. This is very dishonest in my opinion, but they justify it to themselves somehow.

The problem with developing redundancy is not that the capex expense is 50% more than what you’ve already developed, but usually it’s more like 200% of what you’ve already developed. Redundancy is incredibly expensive, and I would think that the risk/reward ratio on the expenditure would be very very high. Whatever problems would result from downtime I think are nothing compared to the crippling effect it would have on most startups and small companies to offer the kind of redundancy that larger companies offer. If the systems and networks are designed for easy redundancy from the start, it can often be done cheaply, but most times I find that redundancy isn’t factored in in the initial architecture, especially for small companies and startups.

The cheapest alternative to redundancy is good process. Hire experienced IT operations managers that enforce good change control processes. Implement processes that standardize architecture among a minimal number of hardware platforms, and implement processes that require a consistent stock available of maintenance spares for every piece of equipment. Implement processes that require backups and processes that require restoring backups frequently to verify their integrity. Implement processes that require exacting documentation before implementing something in your production environment, and keep your organization split between operations and implementation. An operations person is generally not good at implementations (too much attention to detail and an inability to get things online and worry about smaller problems later), and a good implementation person is not good at operations (I fit into this bucket… inattention to detail, get it done at all costs attitude, disregard for most processes). Those people belong in their jobs for a reason, and the turf battle if kept in check is valuable to the organizations stability.

The best way to measure downtime in the end is customer satisfaction. If your customers don’t notice or don’t care that you’re down 2% of the time, then that’s acceptable downtime. If they notice .001% of downtime, then you have meet their expectations. It’s important for sales personnel to set proper expectations and it’s important to focus, at least initially, on more forgiving customers. A good support staff can also mitigate downtime issues. At the end of the day, the number of 9’s of downtime as a metric is only one piece of the puzzle that will make up the retention/churn rates of your customer base.