REVIEW - Lessons from Giant-Scale Services

From: Ian Sin <ian.sinkwokwong_REMOVE_THIS_FROM_EMAIL_FIRST_at_utoronto.ca>
Date: Mon, 19 Sep 2005 08:27:34 -0400

SUMMARY

This paper discusses some principles used for designing and managing
high-availability giant-scale Internet systems. It compares the cluster
models used by giant-scale systems namely replicated systems and partitioned
systems, which are the two ends of the spectrum. This is used as motivation
for selecting metrics to measure the availability of these systems and
observe how these metrics are affected by faults. These observations in turn
lead to conclusions as to what can be done to minimize the impact of these
faults, through graceful degradation.

 

STENGTH

The strength of this paper lies in the DQ principle, based on bandwidth and
the impact of faults on DQ and hence on harvest and yield. Several ideas
show how it would be possible to control/tune the DQ to maintain high
availability, depending on the business model. The graceful degradation is a
neat idea that would ensure availability of the service but either with
increased capacity and reduced harvest; or limit capacity to maintain
harvest.

 

WEAKNESS

The main weakness of this paper is the lack of experimental results to
support the author's "experience" with giant-scale systems. The ideas of
graceful degradation sound good in theory but how well does it perform in
practice is another question. Would it not be less costly to just add more
hardware to make sure we are available during software updates (not talking
about flash crowds here)? For example, for a rolling update of one machine
at a time in a cluster, we might as well add another machine to the cluster
which would take the excess load. This paper also fails to discuss the
strengths of the metrics used, i.e. uptime, harvest and yield, against other
available metrics. This would make his claims stronger.

 

FUTURE WORK

The assumption here is that all queries are small, e.g. HTTP traffic. With
today's peer-to-peer systems we could be having on-going connections at all
times, for example a video stream, and it would be hard to wait for
inactivity before unplugging a node for updates. Research in migrating live
TCP connections in a rolling update scenario would be interesting.

 
Received on Mon Sep 19 2005 - 08:27:46 EDT

This archive was generated by hypermail 2.2.0 : Mon Sep 19 2005 - 09:52:43 EDT