CSC 2209 - Computer Networks - Course Project Web Page

Cooperative Browser Download Streams

Contents (reverse chronological order)

Project Final Write-up (December 13, 2006)
Project Final Presentation (December 8, 2006)
CBDS Firefox Extension (December 6, 2006)
Simulator (December 6, 2006)
5 Minute Project Presentation (November 14, 2006)
Midterm Progress Report (November 10, 2006)
Source and Binaries for Download (November 10, 2006)
Project Proposal (October 10, 2006)

NOTES:

(Nov 10, 2006) This project originally started out being called "Cooperative Browser Caches". This was a misnomer. The goal is to only share the content that the user is downloading while it is being downloaded, and not share the user's cache in general. The project has accordingly been renamed to "Cooperative Browser Download Streams". The project proposal still refers to the original title, but all other content reflects the new name.

Project Final Write-up (December 13, 2006)

The final write-up for the project is here: final_writeup.pdf

Project Final Presentation (December 8, 2006)

The final presentation for the project is here: 2209_final_pres.pdf

CBDS Firefox Extension (December 6, 2006)

The Firefox extension containing the Cooperative Browser Download Streams (CBDS) implementation is here: cbds.xpi

The extension is just a wrapper that bootstraps a piece of java code that acts as a proxy for Firefox. The extension part of the implementation is based on the Java Firefox Extension.

IMPORTANT:

Simulator (December 6, 2006)

In order to demonstrate that the Cooperative Browser Download Streams (CBDS) solution truly saves server bandwidth, a number of simulations were run. These were done using a trace containing a list of 1640179 Kazaa requests made by over 24K clients during a period of 200 days. Prior to running the simulation on the trace, the trace was pruned so that it only contained requests for files larger than 100M bytes. The rationale for this is that the CBDS solution is intended for sharing large files.

The data in the trace file consists of a number of entries, where each line represents the start of a download event. Each line in the file has the following format

      <timestamp> <client_id> <null> <file_id> <file_size>
    
      where
    
      <timestamp> is the time in seconds representing the start time of the file download.
    
      <client_id> uniquely identifies the client
    
      <null> is a dummy entry
    
      <file_id> along with a file_size uniquely identifies a file
    
      <file_size> is the size of the file being downloaded
    
The Simulator works as follows. It maintains a priority queue containing events to be processed. The priority queue is ordered based on the time of the event. In order to populate events into the queue, the simulation has to start by reading entries from the trace file. The main loop of the simulation basically reads an entry from the trace file, notes the time associated with that entry, and processes all events in the event queue up until the time of the current entry read from the file. It then adds the entry read from the file into the event queue (as a START_DOWNLOAD event -- see below for details), reads the next entry from the file, and continues processing events in the queue until the time associated with the latest entry read.

Besides the event queue, the Simulator also maintains a mapping of the clients actively downloading a given file.

There are two types of events that the event queue stores: START_DOWNLOAD events, and END_DOWNLOAD events. Whenever a START_DOWNLOAD event occurs, the simulator checks to see if there's anyone downloading the file. If there is, then one of the clients downloading the file is picked at random to be the peer from which the file is downloaded. If there is no such peer, then the START_DOWNLOAD event results in the current client going to the origin server to download the file. In either case, an entry is added to the mapping of clients actively downloading the file.

END_DOWNLOAD events are inserted into the queue when START_DOWNLOAD events are processed. The time associated with the END_DOWNLOAD event depends on from where the download is being served. If the client is downloading the file from the origin server, then the event is added with a timestamp associated with the time when the download will be finished. This time is calculated based on a transfer rate that the user specifies when running the simulation, and the size of the file, which is known from the trace data and is also stored as part of the event in the event queue.  If the file is being downloaded from a peer client, then the time stamp of the END_DOWNLOAD event is the time at which the peer client will be finished downloading. If this time isn't long enough for the full file to be downloaded, a new START_EVENT is added to the queue (with timestamp equal to the END_EVENT just added), and a new file size representing the remainder of the file that has to be downloaded. When this new START_EVENT is processed, a new peer client can be found from which the rest of the file can be downloaded, or, if no such client can be found, the remainder of bytes can be obtained from the origin server. In this sense, the Simulator does "aggressive sharing", since it always tries to use a peer client for downloading the file, and only obtains the file from the origin server as a last resort.

The  Simulator maintains a running total of the number of bytes transferred between all nodes (including the origin server), and a running total of the bytes transferred between peer nodes (i.e., the number of bytes shared). Periodically it prints out the fraction of bytes shared over the total bytes processed. The Simulator also maintains for each client the maximum number of peer clients it is uploading to concurrently. At the end of the simulation, the cumulative fraction of peers having to concurrently upload to x other peers is calculated, where x ranges from 0 to the maximum number of concurrent uploads that any client has to perform.

The simulation was run using transfer rates of: 1 KB/sec, 5KB/sec, 10 KB/sec, 50KB/sec, 100KB/sec, 500KB/sec.

Here is a copy of the pruned trace file used:  trace.zip

Here is a copy of the source code for the Simulator: Simulator.java

See the final presentation, and write-up for the resulting data.

5 Minute Project Presentation (November 14, 2006)

The 5 minute project presentation is available in PDF format here.

Midterm Progress Report (November 10, 2006)

The midterm progress report is available in PDF format here.

Source and Binaries for Download (November 10, 2006)

The CBDS implementation as described in the Midterm Progress Report (above) is contained in cbds.jar. The jar file comes with the source code. Please note that the source code is presently in a slightly "hackish" state, but this will be cleaned up prior to the final submission at the end of the term.

In order to use CBDS, you need a copy of Apache XML-RPC version 3. This can be obtained from the Apache XML-RPC homepage.

CBDS is a Java proxy that currently runs as a separate process. It will be packaged in an XPI (Firefox extension) file, as described in the midterm report, for the final submission. I haven't packaged it as an extension yet since it is easier to debug when run as a separate process. To run the proxy, you can edit the following shell script to suit your needs: cbds.sh. Basically, point the XMLRPC variable to where your copy of  xmlrpc-3.0/lib is located, and change the system properties if needed. The system property CBProxyPort is the port number that the proxy will be listening on for connections from the browser, and CBServerPort is the port that the the proxy will be listening on for connections from remote browsers.

After you've started the CBDS Java proxy, launch Firefox and go to Edit->Preferences and click the Connections button. Change your settings to Manual HTTP Proxy with hostname "localhost" and port number "10006" (or whatever you changed CBProxyPort to).

CBDS was developed using: Firefox 1.5.0.7 and Sun Java 1.5.0_09.

Project Proposal (October 10, 2006)

What is the problem I am solving?

    When a user is downloading a large amount of content from a server, he would like to increase the speed of his download. At the same time, servers that provide content that is in high demand would like to reduce the load that they experience. The goal is to solve these two problems with one solution: cooperative browser caches. By having users share their caches with other users, they can download content from machines that are closer in proximity to them, potentially increasing the speed of their download, and at the same time reduce the load on the server by obtaining the same content from somewhere else.

Why is the problem interesting?

    It is becoming more and more difficult for servers that provide content that ranks as the most popular on the Internet to serve that content. Many servers can't keep up with the load, and need to use content distribution networks. Companies who want to serve this content are forced to invest more and more money to keep up with the popularity of their websites.
   
    Companies would be interested in solutions to this problem if that solution did not require any capital investment on their part. And this is precisely one of the main selling features of cooperative browser caches: it does not require companies to buy more servers or pay for participating in a content distribution network. The onus is completely on the user for helping share the content that the server is providing.

Why is the problem hard?

    In order for a cooperative caching scheme to work, a user needs incentive to participate in it. Without incentive, the user would have no reason for donating his bandwidth. How is such incentive provided? One possible suggestion is to have a server refuse to provide a client content unless the client is willing to participate in the cooperative caching scheme. In the case of my implementation, this would require the server being aware that the client has the firefox extension installed.
    Even if there is incentive, there also needs to be ease of use. We need to be able to make participating in the cooperative caching scheme a transparent experience. If the user has to install several different pieces of software, run processes separate from his browser, or spend time changing his configuration, he may not want to bother. In that case he will go elsewhere for the content he was seeking, or abandon trying to download it all together.
    If a user chooses to participate in a cooperative caching scheme, he will be essentially advertising to other users the content that he downloaded. This could be seen as an invasion of privacy; he may want to download the content, which requires him to participate in the cooperative caching scheme, but at the same time he may not want others to know that he has downloaded this content. How can a user's privacy be maintained while at the same time allowing him to participate in the scheme?

How am I planning to solve the problem? 

    My goal is to implement a Firefox extension that allows a user to share his cache. This will require intercepting incoming and outgoing HTTP from within the extension, and also serving content from the extension. Content will be "shared" only for the time during which the user is downloading content; the user's cache will not become open indefinitely. Once a user has finished downloading the content, any server sockets opened will be closed, and anyone downloading from the user will be required to finish his download elsewhere (possibly from the origin server).

    OpenDHT will be used to maintain a mapping between content (e.g. URIs) and nodes that are participating the cooperative caching scheme. When a client makes a request for a web page, it will first look up the URI in OpenDHT to see if there are any other nodes currently serving the content. If there is, the request will be redirected to one of those nodes. Otherwise, the request will go to the origin server. In either case, the current client will add a mapping to OpenDHT between the URI its requesting and itself so that other clients are aware that the current client is willing to share the data.

What is the related work?

    Extensive work has been done previously on cooperative web proxy caching. However, Wolman et al in [1] demonstrated that cooperative proxy caching has only a minor benefit within limited population bounds. In [2] Iyer et al examined Squirrel, a decentralized peer-to-peer web cache. This work is very similar to the work that we're proposing to do here. There is also similar work being done in CoopNet [3]. CoopNet examines the benefits that using peer-to-peer communication can have on improving the performance of client - server applications in a network.

References

[1] Alec Wolman, Geoffrey M. Voelker, Nitin Sharma, Neal Cardwell, Anna Karlin, and Henry M. Levy. "On the scale and performance of cooperative Web proxy caching", In Proceedings of the 17th ACM Symposium on Operating Systems Principles (SOSP '99), pages 16-31, Kiawah Island Resort, SC, USA, December, 1999.
[2] S. Iyer, A. Rowstron and P. Druschel, "SQUIRREL: A decentralized, peer-to-peer web cache", appeared in Principles of Distributed Computing (PODC 2002), Monterey, CA.
[3] CoopNet: Cooperative Networking. http://research.microsoft.com/projects/coopnet/

(Link to Course Homepage: CSC2209)