3. Data set

We collected Word, Excel, and PowerPoint documents from the Web. First, we used a commercial search engine to obtain an initial set of URLs. We searched for pages having links to files with suffixes we are interested in (ppt, xls, and doc). Then, we used GNU Wget [15] to recursively retrieve documents from our initial search results.

All downloaded documents were in the binary OLE archive format. Because Microsoft's file formats vary from one version of their software to another, we first converted all our data to the Office 2000 formats. We removed documents that appeared to be corrupt or were not actually Office documents; the doc suffix, in particular, tends to be used by many applications other than Microsoft Word. We also eliminated duplicates, removing approximately 5% of our data set.

We converted all the data to Office 2000 formats and obtained the XML-based representation, using the MS-Office OLE Automation interfaces [13]. OLE Automation allows a simple Java application we wrote to remotely control the Office applications to perform the data conversions.

Table 1 shows a summary of the documents. For each application, it presents the Internet domains from which documents originated, the number of documents, and the number of sites. The local domain corresponds to documents obtained from our local file system. These documents were taken from our local NFS server rather than the Web.

Table 1: Data set. This table presents the domains and sites from which our test documents originated. These numbers reflect the documents that remain after duplicates and corrupted files were removed.

Application	Domain	Documents	Sites
Word	com	412	28
	edu	813	73
	gov	1376	50
	org	362	58
	other	362	27
	local	3007	2
	Subtotal	6481	236
PowerPoint	com	515	95
	edu	669	73
	gov	333	45
	org	474	111
	other	51	10
	local	125	1
	Subtotal	2167	334
Excel	com	553	126
	edu	1520	76
	gov	1343	48
	org	448	93
	other	88	35
	local	104	1
	Subtotal	4056	378
Total		12704	935