We collected Word, Excel, and PowerPoint documents from the Web. First, we used a commercial search engine to obtain an initial set of URLs. We searched for pages having links to files with suffixes we are interested in (ppt, xls, and doc). Then, we used GNU Wget [15] to recursively retrieve documents from our initial search results.
All downloaded documents were in the binary OLE archive format. Because Microsoft's file formats vary from one version of their software to another, we first converted all our data to the Office 2000 formats. We removed documents that appeared to be corrupt or were not actually Office documents; the doc suffix, in particular, tends to be used by many applications other than Microsoft Word. We also eliminated duplicates, removing approximately 5% of our data set.
We converted all the data to Office 2000 formats and obtained the XML-based representation, using the MS-Office OLE Automation interfaces [13]. OLE Automation allows a simple Java application we wrote to remotely control the Office applications to perform the data conversions.
Table 1 shows a summary of the documents. For each
application, it presents the Internet domains from which documents
originated, the number of documents, and the number of sites. The
local domain corresponds to documents obtained from our local file
system. These documents were taken from our local NFS server rather
than the Web.
|