(no subject)

From: Nilton Bila <nilton_REMOVE_THIS_FROM_EMAIL_FIRST_at_cs.toronto.edu>
Date: Thu, 29 Sep 2005 10:24:33 -0400

REVIEW: Why do Internet services fail, and what can be done about it?

Summary

The paper analyses and points out the main causes of service failures in
internet systems and suggest solutions that could mitigate these failures.
This study was conducted by analysing component and service failures data
from three major interent services which included between 500 and 2000
nodes each over periods between 3 and 7 months. It provides for some
unexpected findings and puts forth approaches to minimize failures,
although with no empirical evidence of their efficacy.

Strengths
The paper makes several contributions among which, in no order, it points
out that contrary to common belief operator error is the most important
cause of service failure as it is harder to mask, and many are poorly
equipped for the job (companies invest little in tools for operators as
they see these as having abilities to make administrative changes if they
experience faliures) or have inadequate high level knowledge of their
systems. Secondly, it points out that also contrary to common belief,
front-end machines are in fact a major source of service failures (up to
50% in two cases), which is wake up call as most cluster architectures
often include fewer front-end machines than they do back-ends. Lastly,
they propose a set of techniques, which although untested, could
potentially reduce service failures, among them: correctness testing(both
pre-deployment and in production), redundancy, fault injection and load
testing, configuration testing, component isolation, proactive restart and
better ways to expose and monitor error.

Weaknesses
The paper's study should be performed again with more elaborate access to
failure data such as causes of each component failure so further
conlusions can be drawn such as which classes of components contribute
most for error, for example whether most software errors are operating
system or applications related, whether custom or off-the-shelf software
contributes the most or even what types of coding errors are most common.
It should also implement suggested techniques so that empirical evidence
of their effectiveness, performance and implementation costs can be
assessed.
Although it points out operator failure as the major source of service
failures, only one solution is pointed out (namely configuration checking
tools) which would potentially mitigate only 9 of the 117 failures in
study.
Lastly, it asserts that x86 machines appeared more reliable than expensive
SPARC ones, however it fails to assess the types of loads each machine is
put into as well as wether all machines had same lifetime.

Comments
It is refreshing to see that the huge amounts of reserach and investment
in hardware for availability and failure safety goals has resulted in very
small causality ratio of hardware to service failure (18/1 in Conent, 90/3
in Online), unlike the case with operators/service failures (36/18, 32/10
in Content and Online, respoctively). With this we see that the
recognition by internet services that hardware will always fail and that
the focus should be in minimizing the effects of such failures is proving
fruitful.
Received on Thu Sep 29 2005 - 10:24:41 EDT

This archive was generated by hypermail 2.2.0 : Thu Sep 29 2005 - 10:51:24 EDT