Review - Web Caching and Zipf-like Distributions: Evidence and Implications from Ian Sin on 2005-10-12 (mbox)

From: Ian Sin <ian.sinkwokwong_REMOVE_THIS_FROM_EMAIL_FIRST_at_utoronto.ca>
Date: Wed, 12 Oct 2005 17:24:25 -0400

This paper presents evidence that web workloads follow a Zipf-like
distribution to address conflicting conclusions from other research. It then
presents a model for web requests which they show is very close to the
Zipf-like behavior observed in real traces, albeit ignoring some properties
of web workloads such as document modifications. The study also looks at
hit-ratio and temporal locality property of web request traces. It shows
that web workloads have little correlation between access frequency and size
or modification rate and suggests that temporal locality may not be
important for the web.

The strong point of the paper is that, unlike other studies which drew
conflicting conclusions, this study takes large real life traces from
different populations such as ISP, universities and corporations. This gives
a better representation of the diversity present in the Internet population.
It also confirms that web workloads do not follow the 90/10 rule, meaning
that web caches should probably be designed differently from processor
caches.

I believe an area of improvement of this paper is the section on "Page
request interarrival-times". From reading the paper, it is still unclear to
me as to how the study of this property helps them and what conclusions they
are drawing from this observation.

This study was important in 1999 at a time when Internet traffic was almost
exclusively web workloads. The traces they study were a year or two older.
It is useful because it helps in designing better web caches from size of
web caches, replacement algorithms, and so on. However, we have to ask the
question of whether the results of this study still hold today. In the fall
of 1999, Napster started the idea of peer-to-peer file sharing and since
then, these multimedia workloads (much bigger than the average 15KB
documents they study) dominate Internet traffic [1]. I believe we could
explore (if it's not done already) content-based cooperative caching, where
different caches (or clusters) at different levels of hierarchy could store
different types of contents. Caches could query nearby caches before going
to the up the hierarchy if that is faster.

<b>References</b>

[1] "Measurement, Modeling and Analysis of a Peer-to-Peer
File-Sharing Workload", by Krishna P. Gummadi, Richard J. Dunn, Stefan
Saroiu, Steven D. Gribble, Henry M. Levy, and John Zahorjan. Proceedings of
the 19th ACM Symposium of Operating Systems Principles (SOSP), Bolton
Landing, NY, October 2003.

Received on Wed Oct 12 2005 - 17:24:57 EDT

This archive was generated by hypermail 2.2.0 : Wed Oct 12 2005 - 20:36:57 EDT