A Characterization of Compound Documents on the Web

Eyal de Lara $^{\dagger}$ , Dan S. Wallach $^{\ddagger}$ , and Willy Zwaenepoel $^{\ddagger}$
$^{\dagger}$ Department of Electrical and Computer Engineering
$^{\ddagger}$ Department of Computer Science
Rice University

Recent developments in office productivity suites make it easier for users to publish rich compound documents on the Web. Compound documents appear as a single unit of information but may contain data generated by different applications, such as text, images, and spreadsheets. Given the popularity enjoyed by these office suites and the pervasiveness of the Web as a publication medium, we expect that in the near future these compound documents will become an increasing proportion of the Web's content. As a result, the content handled by servers, proxies, and browsers may change considerably from what is currently observed. Furthermore, these compound documents are currently treated as opaque byte streams, but future Web infrastructure may wish to understand their internal structure to provide higher-quality service.

In order to guide the design of this future Web infrastructure, we characterize compound documents currently found on the Web. Previous studies of Web content either ignored these document types altogether or did not consider their internal structure. We study compound documents originated by the three most popular applications from the Microsoft Office suite: Word, Excel, and PowerPoint. Our study encompasses over 12,500 documents retrieved from 935 different Web sites. Our main conclusions are:

1.: Compound documents are in general much larger than current HTML documents.
2.: For large documents, embedded objects and images make up a large part of the documents' size.
3.: For small documents, XML format produces much larger documents than OLE. For large documents, there is little difference.
4.: Compression considerably reduces the size of documents in both formats.

A Characterization of Compound Documents on the Web

Abstract: