Web Caching and Zipf-like Distributions: Evidence and Implications
------------------------------------------------------------------
Lee Breslau, Pei Chao, Li Fan, Graham Phillips, Scott Shenker

The paper tries to validate the research that has been conducted in the 
area of web caching since 1999(8). First, by analyzing various traces, the 
authors show that web requests follow a zipf-like distribution. Then they  
propose a simple model for requests to verify hit-ratio dependance on the 
cache size and number of requests and temporal locality.

The paper strength comes from the thorough analysis of web caching by using 
traces that are dispersed over 2 years, gathered from different communities 
and by defining the model in order to understand caching. To describe the 
distribution of web requests they actually extended the concept of zipf to 
zipf-like.

As paper weaknesses, I'm not sure about their two conclusions regarding 
correlation between access frequency/document size and access frequency/change 
rate. 
For the first one let's consider multimedia traffic. 
There was a presentation by Krishna Gummadi on P2P networks that addressed 
this issue. The conclusion was that caching large multimedia files makes sense 
a lot and the hit ratio achieved was very high (in total bytes). A simple 
example is a popular movie release that remains "hot" for a period of time. 
The thing is that web caches should be have this in mind and not ignoring it 
like they say in the paper. Other conclusion was the non zipf behavior of P2P 
workloads. However, since this paper is dated 1999 and the P2P boom came after 
the year 2000 it's understandable not seeing this kind of traffic in their 
traces. 
For the second correlation let us consider an "online live event"-based website, 
for example www.livescore.com or www.eurosport.com. What these kind of sites 
usually do is present a live event minute by minute (e.g. a formula 1 race). 
What usually happens is that these sites become very popular for the event 
duration. So isn't this popularity correlated with their rate of change?

Other weakness is represented by the simplicity of the model (independent 
requests, no document modifications). However, they state this limitations
and the fact that an improvement is needed.

Overall, I think the paper gives a good analysis of web traffic from that 
time (1996 - 1998).