Next: Bibliography
Up: A Characterization of Compound
Previous: 4. Experiments
5. Conclusions and Discussion
We characterized compound documents collected from the Web and our
local file system, generated
by the three most popular applications of the Microsoft Office
suite: Word, Excel, and PowerPoint. Our study encompasses over 12,500
documents, comprising over 4 GB of data, retrieved from 935 different
Web sites.
Our main conclusions are:
- 1.
- Compound documents are in general much larger than current HTML
documents. The tail of the size distribution follows the same
power-log distribution previously observed with HTML documents.
- 2.
- For large documents, embedded objects and images comprise the majority
of the document.
- 3.
- For small documents, XML format produces much larger documents
than OLE. For large documents, there is little difference.
- 4.
- Compression considerably reduces the size of documents
in both XML and OLE archives.
Furthermore, our experience studying the MS-Office file formats resulted in the following insights:
- 1.
- The data suggests that the ``save as'' operation is largely
misunderstood by users. The large savings that we show from garbage
collection suggest that users do not understand the implications of
fast-save mode (the default). Moreover, we believe that most
user perceive the ``save as'' operation as just a way to create a copy
of the document.
- 2.
- The lack of support for compression by the OLE Structured Storage
Interface has forced designers to implement ad-hoc solutions in order
to achieve high performance. This experience suggests that a
compression interface would be a desirable addition to the Structured
Storage Interface.
- 3.
- OLE archive formats are likely to remain the preferred intermediate format
for MS-Office documents, while the XML-based format will likely be the
format of choice for Web publishing. The XML-based format has the
advantage that it can potentially be interpreted by application other
than MS-Office (e.g., Web browsers). It is also amenable to
widespread browser techniques that improve user perceived latency,
such as incremental rendering and fetch on-demand. On the flip side, the
current implementation of MS-Office 2000 does not implement
incremental loading or writing of XML-based documents, leading to
higher latencies for opening and storing XML-based documents than
those experienced on similar OLE archive documents. Moreover, some of
the MS-Office formats do not yet have XML equivalents.
Next: Bibliography
Up: A Characterization of Compound
Previous: 4. Experiments
de Lara, Wallach, and Zwaenepoel
November, 1999