next up previous
Next: Bibliography Up: A Characterization of Compound Previous: 4. Experiments

   
5. Conclusions and Discussion

We characterized compound documents collected from the Web and our local file system, generated by the three most popular applications of the Microsoft Office suite: Word, Excel, and PowerPoint. Our study encompasses over 12,500 documents, comprising over 4 GB of data, retrieved from 935 different Web sites.

Our main conclusions are:

1.
Compound documents are in general much larger than current HTML documents. The tail of the size distribution follows the same power-log distribution previously observed with HTML documents.
2.
For large documents, embedded objects and images comprise the majority of the document.
3.
For small documents, XML format produces much larger documents than OLE. For large documents, there is little difference.
4.
Compression considerably reduces the size of documents in both XML and OLE archives.

Furthermore, our experience studying the MS-Office file formats resulted in the following insights:

1.
The data suggests that the ``save as'' operation is largely misunderstood by users. The large savings that we show from garbage collection suggest that users do not understand the implications of fast-save mode (the default). Moreover, we believe that most user perceive the ``save as'' operation as just a way to create a copy of the document.
2.
The lack of support for compression by the OLE Structured Storage Interface has forced designers to implement ad-hoc solutions in order to achieve high performance. This experience suggests that a compression interface would be a desirable addition to the Structured Storage Interface.
3.
OLE archive formats are likely to remain the preferred intermediate format for MS-Office documents, while the XML-based format will likely be the format of choice for Web publishing. The XML-based format has the advantage that it can potentially be interpreted by application other than MS-Office (e.g., Web browsers). It is also amenable to widespread browser techniques that improve user perceived latency, such as incremental rendering and fetch on-demand. On the flip side, the current implementation of MS-Office 2000 does not implement incremental loading or writing of XML-based documents, leading to higher latencies for opening and storing XML-based documents than those experienced on similar OLE archive documents. Moreover, some of the MS-Office formats do not yet have XML equivalents.

next up previous
Next: Bibliography Up: A Characterization of Compound Previous: 4. Experiments
de Lara, Wallach, and Zwaenepoel
November, 1999