Lessons from Giant-Scale Services Summary from Troy Ronda on 2005-09-19 (mbox)

From: Troy Ronda <ronda_REMOVE_THIS_FROM_EMAIL_FIRST_at_cs.toronto.edu>
Date: Mon, 19 Sep 2005 10:42:43 -0400

"Lessons from Giant-Scale Services" Summary

Troy Ronda

This paper attempts to create a simple model and usable metrics for
"giant-scale" internet services. The basis for both the model and metrics
are on the assumption that such services require high availability,
evolution, and growth. The focus is on services that are remote from the
user and driven by read queries. The discussion is around scalability and
availability. The paper provides some background material to support the
discussion. This includes scalability (in terms of failure and upgrades),
replicated and partitioned clusters, load management including layer-4 and
layer-7 switches, and definitions of online evolution and growth. The author
defines several availability metrics including uptime (as a function of
failure repair time and failure occurrence), yield (queries
completed/queries offered), and harvest (data available/complete data). The
author indicates that yield is more effective than uptime because of the
differences between peak and minimum-load time. We can design our system so
that faults impact yield, harvest, or both. We can also define another
metric, which the author calls DQ (data per query X queries per second -->
constant). This is "the total amount of data that has to be moved per second
on average." The author proposes that we analyze failures in terms of DQ
capacity reduction. Hence, replicated systems will maintain "D" and reduce
"Q", while partitioned systems maintain "Q" and reduce "D" during failures.
We should always use replicas above some specified throughput. In order to
achieve a graceful degradation we must reduce harvest (reduce Q) through
admission control, or reduce yield through database reduction (reduce D), or
some combination. There is also some argument that cluster upgrades should
be analyzed in terms of DQ. Hence we can do a fast reboot (large DQ
reduction for shortest time), rolling upgrade (minimum DQ reduction for
longest time), or big flip (medium DQ reduction and medium time). In
practice, rolling upgrades are most popular. The conclusion is a discussion
of best practices for developing "giant-scale" services. These are: "Get the
basics right", "Decide on your availability metrics", focus on failure time
at least as much as reducing failure occurrences, "Understand load
redirection during faults", "Graceful degradation is a critical part of
high-availability strategy", "Use DQ analysis on all upgrade", and "Automate
upgrades as much as possible".

The main strength of this paper is the creation of measurable metrics for
what many systems administrators might consider rules of thumb. Additionally
the background discussion is helpful for people new to field; in particular
the discussion of load management and the models of internet services. The
realization that clusters are the only real choice for giant-scale services
is good. It is also good to point out that simple load redirection is
probably unrealistic; especially with random hashes of the database. In the
end, thinking in terms of metrics will help both researchers and industry.
Researchers can now model failures and have an objective measure for new
solutions. Industry can now examine how failures will affect them in a
concrete way, and hence prepare for them.

The main drawbacks of this paper is that I never had my "eureka" moment -
Most of the material did not seem completely novel to me. In defense of the
paper, I was warned at the beginning (and at the end). I would have liked to
hear more about how his metrics have improved "giant-scale" clusters. It is
true that he gave motivations but it still good to see a graph, which shows
me that these metrics are completely representative of the failures
discussed in the paper. I do like the metrics and they do seem useful but I
still need some hard numbers. Other than these issues, it is hard to pick
out problems since it is an experience paper - I have never worked on a
"giant-scale" system.

Received on Mon Sep 19 2005 - 10:42:56 EDT

This archive was generated by hypermail 2.2.0 : Mon Sep 19 2005 - 10:54:47 EDT