ToXgene V 2.3 released on February 2005!
This new version of ToXgene offers a simple API that allows ToXgene to be invoked from any
Java application. A sample front-end Java class is provided with the samples.
See ToXgene's architecture for more;
the Javadoc documentation can be found here.
ToXgene is a template-based generator for large, consistent collections of
synthetic XML documents, developed as part of the ToX (the Toronto XML Server) project. ToX is an
heterogeneous repository for XML data and metadata being developed at the
Database Group of the University of Toronto.
ToXgene was designed to be declarative, and thus speed up the data
generation cycle; general enough to produce fairly complex XML content;
and powerful enough to capture the most common kinds of integrity
constraints in popular benchmarks.
The ToXgene Template
Specification Language (TSL) is a subset of the XML Schema notation
augmented with annotations for specifying certain properties of the intended
data, such as value distributions, the vocabulary for CDATA
content, etc. Being template-based, our tool gives its users total
control over the structure and content of the XML documents it
produces.
These are the main features of our tool:
- Generation of complex XML content: our tool supports all XML
element content models (
CDATA , element and mixed) and allows the
generation of attributes as well. CDATA values are generated
according to a type declaration; various string, numeric and date types
are supported.
- Use of skewed distributions: the user can specify skewed
distributions to determine the number of occurrences for
elements, as well as to control the generation of
CDATA literals
(e.g., the length of string values). ToXgene supports the uniform,
exponential, normal, log-normal, geometric, and user-defined multinomial
distributions.
- Element sharing: our tool allows different elements (or attributes)
to share
CDATA literals, thus allowing the generation of
references among elements in the same (or in different) documents. This
enables the generation of collections of correlated documents (i.e.,
documents that can be joined by value).
- Integrity constraints: element sharing in ToXgene is achieved by
generating the shared content prior to the actual documents, and storing
this data in memory. Our tool allows the specification of
most common integrity constraints (e.g., uniqueness) over the data in such
lists; thus, one can generate consistent
ID , IDREF
and IDREFS attributes. One can also specify integrity
constraints over elements (or attributes) in different documents, which
allows the generation of consistent single or multi-document data
sets.
- Modularity: ToXgene is implemented in a modular way that decouples
the interface from the data generation engine (see
ToXgene's architecture). Moreover,
access to the data generation engine is done through an API, which means
that ToXgene can be used as part of any Java application.
- Reuse of existing data: our tool allows the user to load
existing data into tox-lists; such data is treated as any other shared
data in the generation process. This allows the mixing of real and
synthetic data, often required in common benchmarks (e.g., names of
countries) and also the growing of existing collections of documents
without having to start from scratch again.
- Extensibility: ToXgene was developed in Java 2, and has a very
simple interface for plugging in new CDATA generators. For convenience, we
provide the source code for the various CDATA generators that come with
ToXgene already.
- Scalability: If necessary, ToXgene can use a Persistent Object
Manager for storing temporary data structures that do not fit in main
memory. One can customize the buffer management and take advantage of
parallel I/O for optimal performance.
|