To its user, a compound document appears to be a single unit of information, but in fact it can contain elements created by different applications. A compound document could, for instance, consist of a spreadsheet and several images embedded into a text document.
In the general case, every data type in a compound document (spreadsheet, text, images, sound, etc) is created and managed by a different application. The different applications used to create the document can be thought of as software components that provide services that are invoked to create, edit, and display the compound document.
Compound documents result from combining the data created by two or more disjoint software components. As a result, there is a need for a standard to govern the interactions between components. Some of the most visible standards are COM/OLE [3,4], SOM/OpenDoc [14], and JavaBeans [7]. Among other things, such standards define how components are uniquely identified, how they are stored on disk, and how they interact with one another and system resources such as the screen.
Typical container applications keep in persistent storage two versions of the components they embed. The first one consists of the embedded component's native data, which is used to initialize the component. This data is created and managed by the component itself. The second representation is a cached image of the state of the component the last time it was instantiated. This image, although created by the component, is managed by the container application. This image serves two purposes. First, it allows the document to be rendered quickly, since the code that understands the component's specific type need not be executed until the user wishes to modify the component. Second, the cached image allow the document to be rendered even on systems where some components are not available.
The Component Object Model (COM) enables software components to export well-defined interfaces and interact with one another. In COM, software components implement their services as one or more COM objects. Every object implement one or more interfaces, each of which exports a number of methods. COM components communicate by invoking these methods.
The Object Linking and Embedding (OLE) specification is a set of standard COM interfaces that enable users to create compound document by linking and embedding objects (components) into container applications, hence the name OLE. OLE includes other component-based technologies, like ActiveX and OLE Automation.
The OLE Structure Storage Interface (SSI) provides the means for multiple components to share a single file. This is necessary because in the user's perception, a compound document is single unit of information and should appear as a single file.
The SSI implements an abstraction similar to a file system within a single file. It supports types of objects: storages and streams. Storages are analogous to directories and contains streams or more storages. Stream are analogous to files and contain the components data.
In compound documents, each embedded component is stored in a separate storage. When an embedded components is instantiated, its embedding container supplies it with a pointer to the storage that holds the components native data. The embedded component uses these data to initialize its state. An embedded component manages its own storage; the parent container need not understand the information stored within it.
Conceptually, MS-Office documents may have up to three classes of components: images, OLE-based embedded components, and virtual components. Images are graphic data that are stored and manipulated directly by the application. This includes the cached versions of any embedded components and any graphic data that the application choose to manipulate directly. OLE-based embedded components are data created using a separate application, as described above. Finally, virtual components are objects that are not implemented as OLE-based components but that are perceived by the user as separate entities (i.e., pages in Word, slides in PowerPoint, and sheets in Excel).
Microsoft Office 2000 supports two native file formats: the traditional OLE-based binary format (hereafter, ``OLE archive'') and a new XML-based format. The OLE archives [11,10,8,9] rely on the OLE SSI to provide a unified view of the compound document in a single file. However, the manner in which MS-Office applications use the OLE SSI to store embedded object varies. Word and Excel, for example, store every embedded object in a separate storage, making the component structure of the document visible to the OLE SSI. In contrast, PowerPoint compresses embedded object native data and stores it in the main application stream. While this strategy increases document compression, it limits the ability of third-party applications to manipulate components within a PowerPoint document.
The new XML format [12] provides a more browser-friendly option for storing MS-Office documents. Where an OLE archive appears as a single file, an XML document appears as an entire directory of XML files, approximately one per component, image, or slide. The current implementation of MS-Office supports two version of XML output: a compact low-fidelity representation that can be read by browsers but cannot be edited by MS-Office tools, and a larger high quality representation that supports editing. In this study we focus on the latter XML representation.
Aside from the number of files that they use, the two file formats differ mostly in their representation of text and formating information. Images and embedded component native data have similar representation in both formats, with the caveat that component data in the XML-based format is stored in a compressed OLE archive.