CSC2231 - Failure study review

From: Madalin Mihailescu <madalin_REMOVE_THIS_FROM_EMAIL_FIRST_at_cs.toronto.edu>
Date: Thu, 29 Sep 2005 10:51:14 -0400

Why do Internet services fail, and what can be done about it?
-------------------------------------------------------------
David Oppenheimer, Archana Ganapathi, David A. Patterson

The paper addresses two important issues in Internet services: why do
they fail and how these failures can be mitigated. The analysis, based
on three real services (Content, Online & Read Mostly) show that most
of the user-visible failures come from operator errors and networking
problems. Also the paper argues that, having the cost/performances
trade-off in mind, techniques for mitigating failures should be used in
Internet services. (e.g. online testing)

The strength of the paper is represented by its insightful analysis of
failures. The measurements identify less frequently masked operator
errors and networking problems as the main factors in service failures.
Also in two of the three services the primary location for failures was
the front-end and in the other one the network (single point of failure).
The paper investigates ways to avoid the failures or reduce their impact
through measurements of existing techniques and case studies. It also
gives a number of directions to follow in order to improve availability
in Internet services.

As a paper weakness, I'm not sure about the security issues part. Attacks
can be a real problem in Internet services and an important source of
service failures.

Although it is not relevant, I think that there is an error in the
problem tracking database example. (is it Online or Content ?)

Overall, the paper provides a thoroughful analysis of failures in
Internet services and in my opinion it manages to answer both questions
of why and how.
Received on Thu Sep 29 2005 - 10:51:23 EDT

This archive was generated by hypermail 2.2.0 : Thu Sep 29 2005 - 11:02:26 EDT