This section presents statistics we have measured for MS-Office documents and the components within them. We compare our results to previously published statistics for HTML documents.
Table 2 shows general statistics for Word, Excel, and PowerPoint documents1. The most striking aspects of the data is the large average size of documents and the large standard deviations of our sample.
Figure 1 shows the size distribution of Word, Excel, and PowerPoint documents. The histogram plots documents with sizes up to 180 KB. We observe that the distributions have the same general shape: a cluster around a common small value with a fairly long tail.
Figure 2 characterizes the distributions' tails by
plotting document size frequencies for documents larger than 100 KB on
a log-log scale. The linear fit of the transformed data (
)
with
R2 = 0.8938 suggest that the tail of the size
distribution follows closely the power-law distribution, which
explains the large standard deviations of table 2.
The log-log scale histograms for the individual Word, PowerPoint, and
Excel documents are not shown here since they are all similar to the
cumulative distribution, with linear fits of
,
,
,
and
R2=0.8612,
R2=0.8352, and
R2=0.8226, respectively.
These results are similar to the findings of Cunha
et. al. [5] where the size of HTML-based Web documents was
found to follow the power-law distribution. However, while Cunha
et. al. found that most HTML documents are quite small (usually
between 256 and 512 bytes), MS-Office compound documents tend to be
much larger. Common sizes of Word and Excel documents size range from
12 KB to 24 KB, and common PowerPoint documents range from 48 KB to
80 KB. Moreover, the size of compound documents smaller than 40 KB
does not follow the power-law distribution.
![]() |
![]() |
|
Figure 3 shows the breakdown of document sizes for Word documents. For every size category it shows the contributions of text, formating information, embedded objects, and images to the documents size. We measured similar breakdowns for PowerPoint and Excel document, however because of space concerns we do not include them in this paper. The PowerPoint documents showed a similar trend to that of Figure 3, while in Excel documents the text component accounts for over 95% of the document size in all the size categories.
Figure 3 show that small Word documents are dominated by text and formating information. However, for larger Word documents, image and embedded component data become the prevalent contributors to document size. This data strongly suggests that efforts to improve access to compound documents should focus on the image and the embedded component data.
One possible optimization would be to remove the embedded component
data from documents that are fetched exclusively for reading. As
described in section 2.2.2, this data is only necessary
when editing an embedded component. Users will still be able to
display the document using the cached image of the component. We
measured the saving of this schema and found that it would lead to a
reduction in bandwidth requirements for Word and PowerPoint documents
of up to 35% and 21%, respectively. PowerPoint documents show less
potential benefit because PowerPoint compresses its components data
before storing it in the OLE archive, whereas Word uses no compression.
![]() |
In this section we compare documents stored in OLE archives with their XML representations. The results of our comparison are shown in table 3 and figures 4, 5, and 6. The data reveals that the XML representation is significantly larger, requiring up to 5 time more space. XML efficiency is particularly low for small files, which according to our data are the most prevalent. However, XML efficiency improves dramatically as documents get larger.
To understand this, we must understand what happens when a document is
converted from an OLE archive to XML. Text and formatting represented
in XML takes more space than Microsoft's internal representation. This
explains the inefficiency of XML for small files. However, the XML
conversion compresses images and embedded component data. PowerPoint
already compresses its embedded component data, but Word and Excel do not.
Because larger documents tend to be mostly images and components
(see figure 3), this explains why the XML
representation becomes more efficient for large documents and becomes
more efficient than the OLE archive for documents larger than 1 MB.
|
![]() |
![]() |
![]() |
In this section we explore the benefits of compression on the OLE archives and the XML formats. For the OLE archives, we compressed the document by applying gzip to the OLE archive. For the XML format, which uses several files, we compressed each file separately. This strategy emulates the potential benefits of a network infrastructure with built-in compression.
The results of these experiments are shown in table 3 and figures 4, 5, and 6. Compression has a dramatic effect on reducing the size of both OLE archives and XML files; achieving savings as high as 77% for the OLE and 90% for XML. Moreover, the difference in size between compressed OLE and compressed XML representations is small enough to be insignificant. This implies that neither representation has an inherent bandwidth advantage when used across a network.
For OLE archives, MS-Office optimizes ``save'' operations by appending modification to the end of the file rather than rewriting the whole file every time. While this optimization allows for much faster document saves, it can lead to a significant increase in file size. If the user deletes or rewrites a substantial portion of a document and saves it, the original data, now garbage, will still be retained, consuming disk space with no benefit to the user.
When a user asks MS-Office to ``save as,'' a new document is written from scratch, without any garbage that may have been in the original document. We measured the changes in file size for OLE archives by using the ``save as'' operation. In this experiment we only considered documents that were already in Office 2000 file formats. Other documents are not included because the ``save as'' operation not only results in garbage collection but also reformats the documents to the Office 2000 formats, potentially increasing or decreasing file size.
Figure 7 shows the results of this experiment. Most
documents get some benefit from garbage collection. Impressively, 24%
of Word documents and 35% of PowerPoint documents achieve saving
greater than 16%.
In this section we first explore the effects of components on document size. We then present detailed statistics for the three type of components found on MS-Office documents: images, embedded components, and virtual components.
We compared the sizes of MS-Office documents with and without embedded
components. Unsurprisingly, documents with embedded components are
significantly larger. For example, the average size a Word documents
with components is 557.28 KB, relative to an average of 112.32 KB for
documents without components. PowerPoint and Excel documents show
similar trends: PowerPoint documents average 1334.43 KB with
components and 493.58 KB without, and Excel documents average
509.71 KB with components and 109.18 KB without. Further details are
presented in figure 8. Despite differences in
average file size, it is interesting to note that documents with and
without components still follow the same size-distribution shape.
Images are the most common type of non-text data found in MS-Office documents. As table 4 shows, 34.62% of Word and 77.01% of PowerPoint documents on the Web have at least one image. We do not present results for Excel documents as very few of them have any images at all. These results are comparable to the findings of Bray [2], where images were found to be the most common non-text elements in HTML document, and 50% of HTML documents had at least one image.
Figure 9 and 10 show that the average number of images and the average size of images for PowerPoint documents. Both plots show similar trends, with increases in the number and size of images as documents get bigger. These results are consistent with the findings of section 4.2, where the size contribution of images to documents becomes the dominant factor as document size increases. The results for Word are similar, so are omitted for compactness.
We compared the average size of images in MS-Office documents to the findings of previous Web studies [17,1]. In general these studies report the average size of images between 5 KB and 22 KB. In comparison, MS-Office documents, especially PowerPoint documents, tend to have larger images.
We measured the reuse of images in our PowerPoint documents by
calculating the Adler-32 checksum [6] of the image's data and
counting the number of documents that have images with the same
signatures. We found that from the 16189 images embedded in PowerPoint
documents, only 14016 are distinct, while 1241 images or 8.85%
appeared in more than one documents. We calculated the potential
bandwidth saving of a perfect cache for reading all PowerPoint images
and found it to be 15.50%, which correlates well to the proportion of
transfers that are saved by the cache - 13.42%. Finally,
figure 11 plots the proportion of images that are
reused in a given number of document for those images that appear in
at least two documents.
|
![]() |
The data in table 5 shows that MS-Office documents are rich in component data, with 18.19% of Word document and 46.38% of PowerPoint documents having at least one embedded component. Furthermore, the data shows a high diversity of component types, with Word documents having the highest diversity.
Table 6 shows the popularity and average size of component types for Word, PowerPoint, and Excel documents. For all three applications, image components are either the first or second most popular type. Additionally, the average size of image components is among the largest of all types. This evidence further suggests that efforts toward reducing file size should focus on image components.
Figure 12 and 13 show that
the average number of embedded components and the average size of
components for Word documents. As with images, both plots show an
increase in the number and size of components as document get
bigger. These results are consistent with the findings of
section 4.2, where the contribution of
embedded components to the document size grows significantly as
document size increases. The results for PowerPoint and Excel
are similar, so are excluded for compactness.
|
|
Table 7 shows the average number of pages, slides, and sheets found in Word, PowerPoint and Excel documents.
Figure 14 shows that the average number of pages
increases initially with the size of the document and then levels
of. This data would imply that most documents have a similar length
and that difference in size are due mainly to the level of
sophistication of the document (i.e., whether it has more pictures and
embedded components). This is consistent with the findings of
section 4.2 where the size of the text element
of Word and PowerPoint document remained almost constant over large
variations in document size.
|