review of failure study

From: Guoli Li <gli_REMOVE_THIS_FROM_EMAIL_FIRST_at_cs.toronto.edu>
Date: Wed, 28 Sep 2005 15:37:57 -0400

This paper explores the failures in large-scale internet services and
provides a clear picture of how failures occur and what techniques are
useful to recover the failures. The discussion is based on solid failure
data analysis gathered in three large-scale internet services: Online,
Content, and ReadMostly.

The key contributions of this paper are: first, the authors identify and
analyze the causes of failures to find the most failure-prone components
in internet services. Second, after discussing a set of failure mitigation
techniques, which could mask component failures from user visible
failures, several failure case studies and lessons are given. Third, the
authors suggest providing better operator interfaces for operators to
avoid some operator errors, which is the largest cause of failures in
internet services. A standardized format of industry-wide failure-cause
data, recording problems, their impact, and their resolutions, and
service-level benchmarks are also useful.

However, the paper would be better if the following problems are fully
addressed. First, Time to Detection (TTD), which is the time from
component failure to detection of the failure, is a very important metric
in internet services. Reducing TTD could improve the system performance of
fault tolerance. TTD is not emphasized by the authors. An end-user visible
failure may consist of several component failures. That is, the failure
could be described by a composite event pattern. To detect such kind of a
failure is similar to the composite event detection in distributed
systems. There are two ways to detect failures. One is a centralized
detection. That is, all the failures in the network are collected by a
central server/database, and the detection happens in one place. In the
distributed detection, the detection happens in the network, local
failures are combined together as close to the failure sources as
possible. The partial detection results can be shared among failure
patterns. There are tradeoffs between the centralized and distributed
ways in terms of network traffic, TTD, and resource consuming.

Second, I think the Time to Repair (TTR) in this paper should be split to
Time to Diagnose (TTDiag) and Time to Repair (TTR). By analyzing TTDiag
and TTR separately, it may have a better chance to decrease the effect of
failure diagnosing and repairing priority. Third, as reported by the
paper, operator errors are the largest cause of failures, it does make
sense that the operator errors should be divided into many categories so
that we could look at different failure operation patterns. As a result,
we may have better knowledge of how to improve the operator tools to avoid
misconfigurations.

Overall, this paper is well-written paper which provides a complete
picture of failure behavior in internet services.
Received on Wed Sep 28 2005 - 15:38:04 EDT

This archive was generated by hypermail 2.2.0 : Wed Sep 28 2005 - 17:28:19 EDT