Failure Review from Troy Ronda on 2005-09-29 (mbox)

From: Troy Ronda <ronda_REMOVE_THIS_FROM_EMAIL_FIRST_at_cs.toronto.edu>
Date: Thu, 29 Sep 2005 09:33:45 -0400

Why do Internet services fail, and what can be done about it?

Review: By Troy Ronda

Global-scale services require 24x7 operations and are designed with
techniques to achieve high availability. This has not prevented failures
from occurring. Services studied in this paper all use redundancy in order
to hide failures from the end-user. This is effective at hiding hardware and
network failures. Operator failures are much harder to hide. Often they
happen at the front-end nodes due to configuration errors. Hardware and
software failures dominate operator failures but operator failures are seen
by the end-user more than other failures. Services can reduce failures
through correctness testing, redundancy, fault injection, load testing,
configuration checking, component isolation, proactive restart, and
exposing/monitoring failures. Online testing would have helped mitigate more
than half the service failures observed at one of the sites. The techniques
of additional redundancy and better exposure/monitoring of failures give the
most significant "bang for the buck." This calls for a unified problem
tracking/bug database. Operators must be given the correct tools to
understand the state of the system. i.e. Operators must be treated as
first-class users. Configuration tools must also perform correctness checks
on the operator input. The study is based on three sites: Online, Content,
and ReadMostly.

Studies like the one in this paper are important. The first step to solving
problems is to understand their cause. The examination of the three sites
seems to be thorough given the data available. Failures probably cannot be
100% prevented. Recognizing that masking failures is just as effective to
the end-user as no failures is an interesting insight. Comparing very
preventable incidents like operator error to less preventable incidents like
hardware failure is good. The evidence given that operator failures have
higher TTR directly leads to the conclusion that better tracking tools will
increase availability. With better problem and solution histories, operators
should be able to repair faster. Intuition also lends credence to the
conclusion that better testing would mitigate many of the failures. The idea
of building online testing and fault injection into the API of emerging
services is a good conclusion based on the study. This assumes that one can
predict the many of the situations as test cases. It seems fair, given a
correct underlying framework, operator tools that do not cause failures, and
an API with testing as a design consideration.

A discussion about the bug history database leading to better test
strategies may have been appropriate. Obviously only using three sites in
the study leads to representative fitness questions. It is true that these
sites are probably representative. However the authors did not make a strong
case to me that other sites do not use better tracking and reporting tools.
I would also like some discussion on whether operator failures are caused
mostly by upgrades or from fixing hardware failures. We are also forced to
trust the methodology of the authors due to the poor structure of the
tracking system. (One can see this comment as a "chicken and egg" problem).
I do like the idea of a public failure repository. However as this seems
highly unrealistic without a major web crash, the authors might want to
discuss more likely alternatives. It would also be good to hear better
alternatives to custom-written software at the front-end.

Received on Thu Sep 29 2005 - 09:34:05 EDT

This archive was generated by hypermail 2.2.0 : Thu Sep 29 2005 - 10:24:42 EDT