Web Caching and Zipf-like Distributions: Evidence and Implications ------------------------------------------------------------------ Lee Breslau, Pei Chao, Li Fan, Graham Phillips, Scott Shenker The paper tries to validate the research that has been conducted in the area of web caching since 1999(8). First, by analyzing various traces, the authors show that web requests follow a zipf-like distribution. Then they propose a simple model for requests to verify hit-ratio dependance on the cache size and number of requests and temporal locality. The paper strength comes from the thorough analysis of web caching by using traces that are dispersed over 2 years, gathered from different communities and by defining the model in order to understand caching. To describe the distribution of web requests they actually extended the concept of zipf to zipf-like. As paper weaknesses, I'm not sure about their two conclusions regarding correlation between access frequency/document size and access frequency/change rate. For the first one let's consider multimedia traffic. There was a presentation by Krishna Gummadi on P2P networks that addressed this issue. The conclusion was that caching large multimedia files makes sense a lot and the hit ratio achieved was very high (in total bytes). A simple example is a popular movie release that remains "hot" for a period of time. The thing is that web caches should be have this in mind and not ignoring it like they say in the paper. Other conclusion was the non zipf behavior of P2P workloads. However, since this paper is dated 1999 and the P2P boom came after the year 2000 it's understandable not seeing this kind of traffic in their traces. For the second correlation let us consider an "online live event"-based website, for example www.livescore.com or www.eurosport.com. What these kind of sites usually do is present a live event minute by minute (e.g. a formula 1 race). What usually happens is that these sites become very popular for the event duration. So isn't this popularity correlated with their rate of change? Other weakness is represented by the simplicity of the model (independent requests, no document modifications). However, they state this limitations and the fact that an improvement is needed. Overall, I think the paper gives a good analysis of web traffic from that time (1996 - 1998).