Review - Why do Internet Services Fail?

From: Jesse Pool <jesse.pool_REMOVE_THIS_FROM_EMAIL_FIRST_at_utoronto.ca>
Date: Wed, 28 Sep 2005 18:07:12 -0400

Understanding failures in large-scale Internet services becomes increasingly
important as systems become larger, increasingly complex and more pervasive
in our lives. Oppenheimer et al. show that a large part of services failures
can be attributed to operational errors. Their conclusions are drawn through
observation of three Internet service companies, an Internet portal, content
hosting service, and a read-mostly service.

The results of this study show that operator error accounts for a large
percentage of the failures observed by customers of large-scale Internet
services. It is also noted that operator error is a large contributor to
time to repair and the most costly operator error is misconfiguration. The
strength of this paper is that the results are not obvious. Generally, focus
is on improving systems from a technological point of view. Here, the
argument is that reliability can be improved by provided a better operator
interface.

Although the study presents a significant amount of interesting and
convincing data, the authors seem to devalue their results in several
instances throughout the paper. For example, the justification for a 'world
wide failure repository' is that operators do not correctly fill out forms,
leading to an inconsistent data set. Also, they confirm that the time to
repair a system is often based on judgement and is performed on a priority
basis. This also leads to complexities in the data set that cannot
completely be observed by looking at the provided graphs.

As a whole, the revolutionary direction of this paper out ways the lack of
convincing data.
Received on Wed Sep 28 2005 - 18:08:05 EDT

This archive was generated by hypermail 2.2.0 : Thu Sep 29 2005 - 00:38:16 EDT