CSC 2231 Failures Review

From: Jin Chen <jinchen_REMOVE_THIS_FROM_EMAIL_FIRST_at_cs.toronto.edu>
Date: Thu, 29 Sep 2005 00:38:12 -0400

This paper studies three Internet services, next concludes the root
reasons leading to failures of Internet services, and then further
analyzes the effectiveness of several techniques that are used to avoid or
reduce service failures.

The authors point out that operator errors are major reasons to cause
service failures. Although it is common sense that human is easy to make
mistakes, it is still surprising that more than half of them are
configuration errors. This result emphasizes the importance of checking
and verifying operators' work. The authors suggest that advanced tools
should be provided to help operators to understand component dependencies.
This paper also indicates failure types are closely related with
application characteristics.

Failures study is usually hard since it needs sufficient cases. Although
the conclusion of this paper seems reasonable, it only studies three
cases; each of them represents a different kind of application, and in
each case, the number of errors is only around hundreds. If the authors
can survey several cases for each type of application, their analysis
would be more convincible.

This paper only focuses on failure types, it ignores failure patterns,
MTBF, yield, harvest as well as the relation between failures and system
scale. It fails to address following questions: first, whether the failure
ratio is linear with the system scale; second, whether failures reflect the
limitation of system architecture since the three cases have different
architectures; third, whether we can roughly predict the frequency of
service failure.

Some costs of failure mitigation techniques in table 7 are not reasonable.
Why redundancy's implementation cost is low? It could be very expensive to
buy and install redundant devices. Why configuration has zero performance
impact? These costs may vary with concrete conditions.

In general, this paper illustrates some important aspects of Internet
system failures though it draws conclusions from a small number of cases.
Received on Thu Sep 29 2005 - 00:38:15 EDT

This archive was generated by hypermail 2.2.0 : Thu Sep 29 2005 - 01:17:17 EDT