next up previous
Next: 5. Conclusions and Discussion Up: A Characterization of Compound Previous: 3. Data set

Subsections

   
4. Experiments

This section presents statistics we have measured for MS-Office documents and the components within them. We compare our results to previously published statistics for HTML documents.

   
4.1 Document Size

Table 2 shows general statistics for Word, Excel, and PowerPoint documents1. The most striking aspects of the data is the large average size of documents and the large standard deviations of our sample.

Figure 1 shows the size distribution of Word, Excel, and PowerPoint documents. The histogram plots documents with sizes up to 180 KB. We observe that the distributions have the same general shape: a cluster around a common small value with a fairly long tail.

Figure 2 characterizes the distributions' tails by plotting document size frequencies for documents larger than 100 KB on a log-log scale. The linear fit of the transformed data ( $y \sim
x^{-1.7124}$) with R2 = 0.8938 suggest that the tail of the size distribution follows closely the power-law distribution, which explains the large standard deviations of table 2. The log-log scale histograms for the individual Word, PowerPoint, and Excel documents are not shown here since they are all similar to the cumulative distribution, with linear fits of $y \sim x^{-1.5254}$, $y
\sim x^{-1.332}$, $y \sim x^{-1.7485}$, and R2=0.8612, R2=0.8352, and R2=0.8226, respectively.

These results are similar to the findings of Cunha et. al. [5] where the size of HTML-based Web documents was found to follow the power-law distribution. However, while Cunha et. al. found that most HTML documents are quite small (usually between 256 and 512 bytes), MS-Office compound documents tend to be much larger. Common sizes of Word and Excel documents size range from 12 KB to 24 KB, and common PowerPoint documents range from 48 KB to 80 KB. Moreover, the size of compound documents smaller than 40 KB does not follow the power-law distribution.

  
Figure 1: Size distribution of Word, PowerPoint, and Excel documents. Shown are documents with sizes up to 180 KB.


  
Figure 2: Size distribution of larger MS-Office documents on a log-log scale. Document size frequencies are measured with 16384 byte bins.


 
Table 2: Document size statistics.
  Application
Statistic Word PowerPoint Excel
average (KB) 196.24 891.48 115.02
stdev (KB) 528.44 2145.35 438.70
median (KB) 47.25 182.25 28.50

   
4.2 Size breakdown

Figure 3 shows the breakdown of document sizes for Word documents. For every size category it shows the contributions of text, formating information, embedded objects, and images to the documents size. We measured similar breakdowns for PowerPoint and Excel document, however because of space concerns we do not include them in this paper. The PowerPoint documents showed a similar trend to that of Figure 3, while in Excel documents the text component accounts for over 95% of the document size in all the size categories.

Figure 3 show that small Word documents are dominated by text and formating information. However, for larger Word documents, image and embedded component data become the prevalent contributors to document size. This data strongly suggests that efforts to improve access to compound documents should focus on the image and the embedded component data.

One possible optimization would be to remove the embedded component data from documents that are fetched exclusively for reading. As described in section 2.2.2, this data is only necessary when editing an embedded component. Users will still be able to display the document using the cached image of the component. We measured the saving of this schema and found that it would lead to a reduction in bandwidth requirements for Word and PowerPoint documents of up to 35% and 21%, respectively. PowerPoint documents show less potential benefit because PowerPoint compresses its components data before storing it in the OLE archive, whereas Word uses no compression.

  
Figure 3: Size breakdown of Word documents. The plot shows that as documents get bigger, images and embedded component data account for most of the document's size.

4.3 OLE archives vs. XML

In this section we compare documents stored in OLE archives with their XML representations. The results of our comparison are shown in table 3 and figures 45, and 6. The data reveals that the XML representation is significantly larger, requiring up to 5 time more space. XML efficiency is particularly low for small files, which according to our data are the most prevalent. However, XML efficiency improves dramatically as documents get larger.

To understand this, we must understand what happens when a document is converted from an OLE archive to XML. Text and formatting represented in XML takes more space than Microsoft's internal representation. This explains the inefficiency of XML for small files. However, the XML conversion compresses images and embedded component data. PowerPoint already compresses its embedded component data, but Word and Excel do not. Because larger documents tend to be mostly images and components (see figure 3), this explains why the XML representation becomes more efficient for large documents and becomes more efficient than the OLE archive for documents larger than 1 MB.

 
Table 3: Size statistics for documents in raw OLE, OLE compressed with gzip, raw XML, and XML compressed with gzip. The statistics for OLE differ from those presented in table 2 due to the conversion to MS-Office 2000 formats.
Format Statistic Application
    Word PowerPoint Excel
OLE average (KB) 209.19 579.53 110.23
  stdev (KB) 534.59 1671.36 401.83
  median (KB) 55.50 120.25 37.5
gzip average (KB) 61.43 481.18 25.67
OLE stdev (KB) 248.89 1597.20 97.88
  median (KB) 12.62 51.62 7.71
XML average (KB) 226.43 795.17 336.90
  stdev (KB) 583.79 1851.92 1562.04
  median (KB) 70.45 299.15 78.04
gzip average (KB) 74.14 549.03 28.37
XML stdev (KB) 297.21 1713.56 92.02
  median (KB) 13.49 106.00 9.37


  
Figure 4: Size distribution of Word documents in OLE archives and XML formats. The plot shows the average sizes of documents in various size categories stored in the OLE archives and the newer XML file format, both with and without compression. All sizes are normalized by the size of the documents stored in the uncompressed OLE archives.


  
Figure 5: Size distribution of PowerPoint documents in OLE archives and XML formats. The plot shows the average sizes of documents in various size categories stored in the OLE archives and the newer XML file format, both with and without compression. All sizes are normalized by the size of the documents stored in the uncompressed OLE archives.


  
Figure 6: Size distribution of Excel documents in OLE archives and XML formats. The plot shows the average sizes of documents in various size categories stored in the OLE archives and the newer XML file format, both with and without compression. All sizes are normalized by the size of the documents stored in the uncompressed OLE archives.

4.4 Compression

In this section we explore the benefits of compression on the OLE archives and the XML formats. For the OLE archives, we compressed the document by applying gzip to the OLE archive. For the XML format, which uses several files, we compressed each file separately. This strategy emulates the potential benefits of a network infrastructure with built-in compression.

The results of these experiments are shown in table 3 and figures 45, and 6. Compression has a dramatic effect on reducing the size of both OLE archives and XML files; achieving savings as high as 77% for the OLE and 90% for XML. Moreover, the difference in size between compressed OLE and compressed XML representations is small enough to be insignificant. This implies that neither representation has an inherent bandwidth advantage when used across a network.

4.5 Garbage collection

For OLE archives, MS-Office optimizes ``save'' operations by appending modification to the end of the file rather than rewriting the whole file every time. While this optimization allows for much faster document saves, it can lead to a significant increase in file size. If the user deletes or rewrites a substantial portion of a document and saves it, the original data, now garbage, will still be retained, consuming disk space with no benefit to the user.

When a user asks MS-Office to ``save as,'' a new document is written from scratch, without any garbage that may have been in the original document. We measured the changes in file size for OLE archives by using the ``save as'' operation. In this experiment we only considered documents that were already in Office 2000 file formats. Other documents are not included because the ``save as'' operation not only results in garbage collection but also reformats the documents to the Office 2000 formats, potentially increasing or decreasing file size.

Figure 7 shows the results of this experiment. Most documents get some benefit from garbage collection. Impressively, 24% of Word documents and 35% of PowerPoint documents achieve saving greater than 16%.

  
Figure 7: Percentage saved by garbage collection of OLE archive documents.

   
4.6 Components

In this section we first explore the effects of components on document size. We then present detailed statistics for the three type of components found on MS-Office documents: images, embedded components, and virtual components.

4.6.1 Components and document size

We compared the sizes of MS-Office documents with and without embedded components. Unsurprisingly, documents with embedded components are significantly larger. For example, the average size a Word documents with components is 557.28 KB, relative to an average of 112.32 KB for documents without components. PowerPoint and Excel documents show similar trends: PowerPoint documents average 1334.43 KB with components and 493.58 KB without, and Excel documents average 509.71 KB with components and 109.18 KB without. Further details are presented in figure 8. Despite differences in average file size, it is interesting to note that documents with and without components still follow the same size-distribution shape.

  
Figure 8: Comparison of size distribution of Word documents with and without components.

4.6.2 Images

Images are the most common type of non-text data found in MS-Office documents. As table 4 shows, 34.62% of Word and 77.01% of PowerPoint documents on the Web have at least one image. We do not present results for Excel documents as very few of them have any images at all. These results are comparable to the findings of Bray [2], where images were found to be the most common non-text elements in HTML document, and 50% of HTML documents had at least one image.

Figure 9 and 10 show that the average number of images and the average size of images for PowerPoint documents. Both plots show similar trends, with increases in the number and size of images as documents get bigger. These results are consistent with the findings of section 4.2, where the size contribution of images to documents becomes the dominant factor as document size increases. The results for Word are similar, so are omitted for compactness.

We compared the average size of images in MS-Office documents to the findings of previous Web studies [17,1]. In general these studies report the average size of images between 5 KB and 22 KB. In comparison, MS-Office documents, especially PowerPoint documents, tend to have larger images.

We measured the reuse of images in our PowerPoint documents by calculating the Adler-32 checksum [6] of the image's data and counting the number of documents that have images with the same signatures. We found that from the 16189 images embedded in PowerPoint documents, only 14016 are distinct, while 1241 images or 8.85% appeared in more than one documents. We calculated the potential bandwidth saving of a perfect cache for reading all PowerPoint images and found it to be 15.50%, which correlates well to the proportion of transfers that are saved by the cache - 13.42%. Finally, figure 11 plots the proportion of images that are reused in a given number of document for those images that appear in at least two documents.

 
Table 4: Images statistics for Word and PowerPoint documents. The table shows the percentage of documents that have at least one images, the average number of images in documents that have images, and the average image size.
  Application
Statistic Word PowerPoint
% of documents with images 34.62 77.01
average number of images 6.01 10.62
average image size (KB) 21.58 47.82


  
Figure 9: Average number of images in PowerPoint documents.


  
Figure 10: Average image size in PowerPoint documents.


  
Figure 11: Image reuse. The plot shows the proportion of images that appear in multiple PowerPoint documents.

4.6.3 Embedded components

The data in table 5 shows that MS-Office documents are rich in component data, with 18.19% of Word document and 46.38% of PowerPoint documents having at least one embedded component. Furthermore, the data shows a high diversity of component types, with Word documents having the highest diversity.

Table 6 shows the popularity and average size of component types for Word, PowerPoint, and Excel documents. For all three applications, image components are either the first or second most popular type. Additionally, the average size of image components is among the largest of all types. This evidence further suggests that efforts toward reducing file size should focus on image components.

Figure 12 and 13 show that the average number of embedded components and the average size of components for Word documents. As with images, both plots show an increase in the number and size of components as document get bigger. These results are consistent with the findings of section 4.2, where the contribution of embedded components to the document size grows significantly as document size increases. The results for PowerPoint and Excel are similar, so are excluded for compactness.

 
Table 5: Embedded components statistics. The table shows the percentage of documents that have at least one embedded components, the number of different component types, the average number of components in a document, and the average, standard deviation, and mean of the sizes of embedded components.
  Application
Statistic Word PowerPoint Excel
% with components 18.19 46.38 1.42
number of component types 55 11 8
average number of components 6.71 9.18 9.05
average component size (KB) 37.62 18.51 26.01
stdev (KB) 141.78 109.33 133.37
median (KB) 1.63 2.31 16.81


 
Table 6: Average size and popularity of component types in Word, PowerPoint, and Excel documents.
Application Component Avg. Size (KB) % of Occurrences
Word Equation 0.74 51.12
  Word Picture 80.68 14.81
  Clip Art 10.38 8.93
  Excel Sheet 153.46 8.53
  OLE Link 23.80 3.90
  Paint Brush 315.75 1.71
  MS Draw 7.28 1.35
  PowerPoint 41.74 0.96
  Other 97.52 8.68
PowerPoint Clipart Gallery 5.00 41.12
  Other Image Components 28.68 38.80
  Word Table 23.28 8.01
  Excel 57.13 4.18
  Graph 3.26 3.86
  Equation Editor 0.81 1.82
  Excel Chart 38.08 1.09
  Organization Chart 2.87 0.75
  Note-It OLE 32.11 0.19
  Wordart 1.31 0.15
  Sound 3.35 0.02
Excel Paint Brush 17.91 0.42
  Word 28.17 0.32
  Clipart Gallery 2.56 0.15
  Forms 2.39 0.05
  Image 4.99 0.05
  Word Picture 2066.83 0.00


  
Figure 12: Average number of embedded components in Word documents.


  
Figure 13: Average size of embedded components in Word documents.

4.6.4 Virtual components

Table 7 shows the average number of pages, slides, and sheets found in Word, PowerPoint and Excel documents.

Figure 14 shows that the average number of pages increases initially with the size of the document and then levels of. This data would imply that most documents have a similar length and that difference in size are due mainly to the level of sophistication of the document (i.e., whether it has more pictures and embedded components). This is consistent with the findings of section 4.2 where the size of the text element of Word and PowerPoint document remained almost constant over large variations in document size.

 
Table 7: Virtual components. The table shows statistics for pages in Word, slides in PowerPoint, and sheets in Excel documents.
Statistic Word Pages PowerPoint Slides Excel Sheets
average 11.95 20.59 5.22
stdev 27.76 17.48 6.49
median 4 17 2


  
Figure 14: Average number of pages in Word documents.


next up previous
Next: 5. Conclusions and Discussion Up: A Characterization of Compound Previous: 3. Data set
de Lara, Wallach, and Zwaenepoel
November, 1999