RE: Zipf Review from Troy Ronda on 2005-10-13 (mbox)

From: Troy Ronda <ronda_REMOVE_THIS_FROM_EMAIL_FIRST_at_cs.toronto.edu>
Date: Thu, 13 Oct 2005 11:02:24 -0400

Thanks. I was trying to use gmail to send it but copying and pasting
obviously didn't work very well.

-----Original Message-----
From: Stefan Saroiu [mailto:stefan]
Sent: October 13, 2005 11:01 AM
To: Troy Ronda
Subject: Re: Zipf Review

Wrong line and index code. Try again.

--Stefan

On Thu, 13 Oct 2005, Troy Ronda wrote:

Web Caching and Zipf-like Distributions: Evidence and Implications
Review By: Troy Ronda

Web proxy caching is an important technique for reducing traffic on
the Internet. Creating a statistical model of web traffic will help in
the design and selection of caching techniques. The relative frequency
of web requests have been shown to follow a form of Zipf's law, called
Zipf-like distribution. A distribution is Zipf-like when the relative
probability of a request for the i'th most popular page is
proportional to 1/i^(alpha). Zipf's law is the distribution when
(alpha)=1. Six traces were examined by the authors of size 1,766,409
to size 4,815,551 requests. They found: The distribution of web
requests from a fixed group of users is Zipf-like (contrary to the
Almeida results who found that web proxies do not follow a Zipf
distribution because the users are fixed). It takes 25%-40% of
documents to draw 70% of web accesses.
There is low correlation between frequency of access and size. There
is low correlation between frequency of access and average
modifications per request is also low. From these empirical
observations, the authors create a simple zipf-like distribution
model. This model has a hit-ratio that grows logarithmically against
number of requests with an infinite cache size. When the cache is
limited to holding the C most popular documents, the hit-ratio will
grow logarithmically against cache size. This model was also observed
in operating systems research as "independent reference model." That
research showed that LFU is the best cache replacement policy for this
model. The perfect-LFU strategy has the best byte hit-ratio in most
cases for the six traces.

The observation of zipf-like distributions for accesses is well
supported by previous studies. The three validations for the model is
also well supported by previous studies. I found it very interesting
that many studies can show that accesses follow a distribution so
closely. I was not surprised by the weak correlation between document
size and number of accesses. Most users will not know the size before
requesting the page, hence low correlation. I was also not surprised
by the weak correlation between rate of change and popularity. At a
static point in time, users will not know if the page has been updated
- this impacts the future popularity, not current. I found the charts
to be informative using the log-log scale because it is easier to
follow a straight line. The connection to operating system research
validated the usage of perfect-LFU to me. It also makes sense that
In-Cache LFU is a poor choice. The paper was a nice tour from previous
results to empirical measurement to "theory" to real world.

The traces (and conclusions) shown in this paper are in direct
contrast to the two Almeida results. It would be appropriate to have a
longer explanation than "confusion" of Zipf's law and "old" traces.
Perhaps the authors could have re-examined the Almeida traces. The
authors might not know if the users push the stop button during a
request (coming from the low correlation between size and accesses).
If this is the case, does the byte hit-ratio change? Would this impact
the model? It also seems that change rate might impact the length of
time that a document stays popular, which may be an important factor
for a model. I think this change rate measurement needed more
explanation. First, do you say that dynamic pages are modified each
access? Second, some dynamic pages have unique URLs for each request.
Third, many elements of a page do not change but as a whole the
document is different. I believe the trace periods may be too short as
evidenced by my above remarks; a longer period may have gained
additional insight. It would also be nice to have some more discussion
on a practical "perfect-LRU" if this is the best replacement policy.
Received on Thu Oct 13 2005 - 11:03:20 EDT

This archive was generated by hypermail 2.2.0 : Thu Oct 13 2005 - 11:03:22 EDT