Review: Lessons from Giant-Scale Services

From: Jing Su <jingsu_REMOVE_THIS_FROM_EMAIL_FIRST_at_cs.toronto.edu>
Date: Mon, 19 Sep 2005 09:52:35 -0400

Lessons from Giant-Scale Services,
Eric Brewer,
IEEE Internet Computing 2001.

SUMMARY

This article is a "practical lessons learned" paper, providing
insights into the topologies and metrics used for running and
maintaining large scale Internet services. The perspectives presented
in this paper are focused on large numbers of (relatively) small
queries, most of which are independent of one another. Thus a cluster
computing platform is most appropriate. This paper's insights are not
as useful for large vector-super-computer scientific platforms.

The author presents the "DQ principle", stated as:
DataPerQuery X QueriesPerSecond -> constant

The intuition behind this "rule of thumb" is that physical
architectural limitations, such as IO, binds these two related
factors. This principle is then used to analyze effects of system
failures, overload, and upgrade management. For example, database
search or field returns can be reduced if Q drops under high demand.
The constant value of the DQ equation can also be adjusted and
re-evaluated, given node failures, upgrades, or selective
partitioning.

KEY STRENGTHS

The key strength of this paper is the author's use of his DQ principle
for assessing and planning failure and maintenance work on cluster
systems. Using this principle, the author shows that a system can be
tailored to fit the organization's computing needs.

KEY WEAKNESSES

As a 2001 paper, I felt that the article spent too much time
explaining the expensive switches used to aid in load balancing and
fault-tolerance switch-overs. If the hardware switches exist, it
really isn't novel, and the switch manufactures obviously understood
this need.

While the DQ principle and various practical experiences are valuable,
the author does not present any significant vision for what the tools
of tomorrow for handling large-scale servers might be like. Many of
the principles presented here are slight improvements on ideas well
known by quality network administrators and engineers. Thus, in this
context, I was disappointed to find a lack of vision in the
conclusions and future work, especially coming from an academic
researcher and founder of a technology company.

OTHER COMMENTS

The biggest take-away point in this article is not actually explicitly
stated. In a subtle way, this article highlights that a company's
real IT cost is technical staff and lost potential business
opportunity (due to downtime or failure). The original motivations
for running clusters is to leverage commodity hardware in order to
create large-scale computational power. As it turns out, the real
benefit of cluster systems is to improve availability and long term
growth and maintainability. I believe there is no reason for
companies to purchase "bottom-of-the-barrel" systems for cluster
nodes. On the contrary, they will likely choose top model systems
loaded with quality components. This is analogous to RAID research
originally aimed at allowing use of Inexpensive Disks, but resulted in
large installations using very high-quality disks to provide increased
performance and integrity.
Received on Mon Sep 19 2005 - 09:52:42 EDT

This archive was generated by hypermail 2.2.0 : Mon Sep 19 2005 - 10:12:47 EDT