review: Why do Internet services fail, and what can be done about it?

From: Jing Su <jingsu_REMOVE_THIS_FROM_EMAIL_FIRST_at_cs.toronto.edu>
Date: Wed, 28 Sep 2005 13:58:16 -0400

Review:
"Why do Internet services fail, and what can be done about it?"

This paper presents an empirical study on the many causes of internet
service failures -- particularly the ones which produce a visible
degradation in client service quality. Long term data regarding the
failures from three large sites were collected and analyzed.

This paper shows that the largest contributor of failures for internet
systems is caused by operator error. Not surprisingly, the paper also
found that tools available to operators for analysis and diagnosis
were extremely lacking.

Humans as "the weak link" is well known in the security community, and
many in the I.T. SysAdmin field would agree that this is a well known
problem. The key strength of this paper is providing empirical
evidence to verify this belief, and place it within the context of
other failures and problems.

The paper concludes that significant improvements in the tools
available to system operators are needed. In a way, internet services
are like living software projects. The software engineering community
has long recognized that human errors are the greatest contributor to
failures, and tools and techniques need to be developed to handle the
increasing complexity. Similarly, as internet systems grow larger and
more complex, with many intertwining and interdependent components,
improved tools are necessary to understand the health of the system.

This paper suggests many techniques for testing and verification in
order to prevent failures due to system change. However, the paper
did not consider a more automated approach. Instead of writing fault
injection test cases and using offline pre-staging machines, an oracle
system can be used for watching system health. This oracle system can
characterize "normal" traffic (both exterior as well as interior), and
raise flags when it sees patterns which deviate from this norm.
Received on Wed Sep 28 2005 - 13:58:27 EDT

This archive was generated by hypermail 2.2.0 : Wed Sep 28 2005 - 15:18:06 EDT