REVIEW: Web Caching and Zipf-like Distributions: Evidence and Implications

From: Nilton Bila <nilton_REMOVE_THIS_FROM_EMAIL_FIRST_at_cs.toronto.edu>
Date: Thu, 13 Oct 2005 10:24:09 -0400

REVIEW: Web Caching and Zipf-like Distributions: Evidence and Implications

The paper studies the applicability of the Zipf distribution to web
requests at proxies and whether there are correlations between the
popularity of pages and their sizes or rate of modification. Based on
analysis of six diverse traces totalling over 17 million requests, it
concludes that the distribution of page requests from a fixed group of
users does not follow a Zipf distribution, as claimed in previous papers,
and that instead its distribution is Zipf-like. A Zipf distribution states
that the probability of a request for the ith most popular page is
&#937;/i, however evidence from the paper points out that it is instead
&#937;/i^&#945;, where 0 < &#945; < 1 and its value varies from trace to
trace, according to its homogeneity. The paper also points out that there
is little correlation between the popularity of pages and their sizes or
modification rates, in contrast to claims in existing literature.

The paper is of great qualitative value as, prior to its publication,
there was no agreement on the issues. There was inconclusive evidence to
both sides of the argument that web requests followed a Zipf distribution.
The paper also makes an analysis of cache hit ratio and concludes that it
grows logarithmically as a function of cache size, of the client
population of the proxy and of the number of requests seen by the proxy.
Weak correlation between page popularity and page size or modification
rate is evidenced well by the graphs of figures 3 and 4. The paper also
debunks the idea that the 10/90 rule applies to web access as well it
points out that in their traces no single web server accounts for the
majority of popular pages. Cache strategies are also discussed, based on
asymptotic properties of requests found.

The paper could, however, benefit from some consideration on whether the
fact that web documents are composed of multiple contents, including the
HTML page, images etc has an affect on the popularity distribution, and
whether some consideration should be made as to whether a page and its
image should classified as one or two documents. Graphs of figures 1, 4
and 5 are drawn with heavy lines which may to some extent mask the
distinction between the observed relations and the fitted curve, which in
turn would biase results towards the paper's arguments.

A blatant example of the inconclusiveness of results from prior research
could be seen by looking at to conflicting papers by Almeida et al[1],[2]
as well as that of Cunha et al[5] which share a number of co-authors.
Received on Thu Oct 13 2005 - 10:24:21 EDT

This archive was generated by hypermail 2.2.0 : Thu Oct 13 2005 - 10:39:55 EDT