next up previous
Next: 4. Experiments Up: A Characterization of Compound Previous: 2. Background

   
3. Data set

We collected Word, Excel, and PowerPoint documents from the Web. First, we used a commercial search engine to obtain an initial set of URLs. We searched for pages having links to files with suffixes we are interested in (ppt, xls, and doc). Then, we used GNU Wget [15] to recursively retrieve documents from our initial search results.

All downloaded documents were in the binary OLE archive format. Because Microsoft's file formats vary from one version of their software to another, we first converted all our data to the Office 2000 formats. We removed documents that appeared to be corrupt or were not actually Office documents; the doc suffix, in particular, tends to be used by many applications other than Microsoft Word. We also eliminated duplicates, removing approximately 5% of our data set.

We converted all the data to Office 2000 formats and obtained the XML-based representation, using the MS-Office OLE Automation interfaces [13]. OLE Automation allows a simple Java application we wrote to remotely control the Office applications to perform the data conversions.

Table 1 shows a summary of the documents. For each application, it presents the Internet domains from which documents originated, the number of documents, and the number of sites. The local domain corresponds to documents obtained from our local file system. These documents were taken from our local NFS server rather than the Web.

 
Table 1: Data set. This table presents the domains and sites from which our test documents originated. These numbers reflect the documents that remain after duplicates and corrupted files were removed.
Application Domain Documents Sites
Word com 412 28
  edu 813 73
  gov 1376 50
  org 362 58
  other 362 27
  local 3007 2
  Subtotal 6481 236
PowerPoint com 515 95
  edu 669 73
  gov 333 45
  org 474 111
  other 51 10
  local 125 1
  Subtotal 2167 334
Excel com 553 126
  edu 1520 76
  gov 1343 48
  org 448 93
  other 88 35
  local 104 1
  Subtotal 4056 378
Total   12704 935


next up previous
Next: 4. Experiments Up: A Characterization of Compound Previous: 2. Background
de Lara, Wallach, and Zwaenepoel
November, 1999