Review - Why do Internet Services fail, and what can be done about it?

From: Ian Sin <ian.sinkwokwong_REMOVE_THIS_FROM_EMAIL_FIRST_at_utoronto.ca>
Date: Wed, 28 Sep 2005 17:27:50 -0400

This paper analyzes the causes of user-visible failures or service impacting
failures for large-scale Internet services, by leveraging problem-tracking
databases and individual problem reports. They subsequently identify the
major causes of failures and suggest different mitigation techniques that
would help improve availability.

 

The strong point of this report is that it is one of the few recent studies
that analyze failures on large-scale Internet services. It supports its
arguments with numbers and graphs that have been mined from problem reports
and present it in a clear fashion. They also have a few good case studies
that they present, which I found rather interesting.

 

However, the weak points of this report are as follows. It is questionable
whether the subset of large-scale Internet services is representative of the
industry. How large is large and how does this report reflect the situation
in not-so-large Internet services. I also believe that they have some
problems in their discussion of "service failure cause". See Appendix. This
report also uses MTTR (Table 5), which is not a representative measure of
the mean time it takes to fail. The median may be a better measure as we
discussed.

 

Ideally all their recommendations for mitigation would be carried out by
companies. However, we need to take into account that there are a lot of
external factors. For instance, management and marketing might want the new
system or service up and running and the technical team cannot finish all
the testing or has carried out only a subset of them. I believe that this is
a realistic scenario as time is money.

 

Appendix

In the section above Table 3, I quote, "networking problems were a
significant cause of failure in all three services, and they caused a
surprising 76% of all service failures at ReadMostly". If I am correct, they
got this number by adding up the percentages in Table 3, i.e., Operator net
(14%), H/W Net (10%), S/W Net (19%) and unknown net (33%). This amounts to
76%. However, this row does not even account for 100%, making their claimed
76% flawed!

 
Received on Wed Sep 28 2005 - 17:28:18 EDT

This archive was generated by hypermail 2.2.0 : Wed Sep 28 2005 - 18:08:06 EDT