Next: 2. Background Up: A Characterization of Compound Previous: A Characterization of Compound

1. Introduction

Productivity tools, often part of ``office suites,'' are the most popular applications for creating documents. Their popularity derives, to some extent, from their capability to create compound documents that include data from more than one application.

The documents produced by office suites have been, until recently, very unwieldy for putting on the Web. These documents can be quite large, sometimes hundreds of kilobytes, and Web browsers do not know how to display them without first completely downloading the document.

Recent improvements in office suites have made it easier to publish compound documents on the Web by exporting browser-compatible data types. These improvements, coupled with the popularity of productivity tools, will likely have a strong impact on the content of the Web and the infrastructure that supports this content. Previous studies have either overlooked compound document altogether or have treated them as black boxes [2,18,16,1,17]. Therefore, it becomes important to study and characterize compound documents on the Web, as they are now, to predict where the Web might be going, and how Web infrastructure should support compound documents in the future.

For this paper, we studied compound documents generated by the three most popular application of the Microsoft Office suite: Word, Excel, and PowerPoint. We chose to focus on Microsoft Office applications based on two factors. First, Microsoft Office is the most widely-used productivity suite. Moreover, a significant number of Microsoft Office documents are available on the Web, enabling us to gather the data for our experiments. Second, Microsoft Office 2000 supports two native file formats: the proprietary OLE-based binary format and a new XML format. By using Office 2000 to convert old files to the new XML format, we can compare the tradeoffs of using a proprietary binary-based file format against a modern standards-based text format, both as intermediate formats suitable for document editing, and as publishing formats, suitable only for reading.

We downloaded over 12500 documents, comprising over 4 GB of data, from 935 different sites. Our main results are:

1.: Compound documents are in general much larger than current HTML documents. The tail of the size distribution follows the same power-log distribution observed with HTML documents.
2.: For large documents, embedded objects and images comprise a large part of the document.
3.: For small documents, XML format produces much larger documents than OLE. For large documents, there is little difference.
4.: Compression considerably reduces the size of documents in both formats.

The rest of this document is organized as follows. Section 2 provides some background on compound documents and their enabling technology. We also discuss relevant characteristics of the three Microsoft Office applications that we use in this study. Section 3 describes the documents we used in our experiments. Section 4 presents our experimental results. Finally, section 5 discusses our conclusions.

Next: 2. Background Up: A Characterization of Compound Previous: A Characterization of Compound

de Lara, Wallach, and Zwaenepoel
November, 1999