ToXGene - the ToX XML Generator

ToXgene - the ToX XML Data Generator

[home]

other online resources

ToXgene V 2.3 released on February 2005! This new version of ToXgene offers a simple API that allows ToXgene to be invoked from any Java application. A sample front-end Java class is provided with the samples. See ToXgene's architecture for more; the Javadoc documentation can be found here.

ToXgene is a template-based generator for large, consistent collections of synthetic XML documents, developed as part of the ToX (the Toronto XML Server) project. ToX is an heterogeneous repository for XML data and metadata being developed at the Database Group of the University of Toronto.

ToXgene was designed to be declarative, and thus speed up the data generation cycle; general enough to produce fairly complex XML content; and powerful enough to capture the most common kinds of integrity constraints in popular benchmarks.

The ToXgene Template Specification Language (TSL) is a subset of the XML Schema notation augmented with annotations for specifying certain properties of the intended data, such as value distributions, the vocabulary for CDATA content, etc. Being template-based, our tool gives its users total control over the structure and content of the XML documents it produces.

These are the main features of our tool:

Generation of complex XML content: our tool supports all XML element content models (CDATA, element and mixed) and allows the generation of attributes as well. CDATA values are generated according to a type declaration; various string, numeric and date types are supported.
Use of skewed distributions: the user can specify skewed distributions to determine the number of occurrences for elements, as well as to control the generation of CDATA literals (e.g., the length of string values). ToXgene supports the uniform, exponential, normal, log-normal, geometric, and user-defined multinomial distributions.
Element sharing: our tool allows different elements (or attributes) to share CDATA literals, thus allowing the generation of references among elements in the same (or in different) documents. This enables the generation of collections of correlated documents (i.e., documents that can be joined by value).
Integrity constraints: element sharing in ToXgene is achieved by generating the shared content prior to the actual documents, and storing this data in memory. Our tool allows the specification of most common integrity constraints (e.g., uniqueness) over the data in such lists; thus, one can generate consistent ID, IDREF and IDREFS attributes. One can also specify integrity constraints over elements (or attributes) in different documents, which allows the generation of consistent single or multi-document data sets.
Modularity: ToXgene is implemented in a modular way that decouples the interface from the data generation engine (see ToXgene's architecture). Moreover, access to the data generation engine is done through an API, which means that ToXgene can be used as part of any Java application.
Reuse of existing data: our tool allows the user to load existing data into tox-lists; such data is treated as any other shared data in the generation process. This allows the mixing of real and synthetic data, often required in common benchmarks (e.g., names of countries) and also the growing of existing collections of documents without having to start from scratch again.
Extensibility: ToXgene was developed in Java 2, and has a very simple interface for plugging in new CDATA generators. For convenience, we provide the source code for the various CDATA generators that come with ToXgene already.
Scalability: If necessary, ToXgene can use a Persistent Object Manager for storing temporary data structures that do not fit in main memory. One can customize the buffer management and take advantage of parallel I/O for optimal performance.