30th International Conference on
Very Large Data Bases
 
Royal York Hotel
29 August - 3 September 2004
Toronto, Canada

 

 

ABSTRACTS

 


RESEARCH SESSION 1:COMPRESSION AND INDEXING

Compressing Large Boolean Matrices using Reordering Techniques
David S. Johnson, Shankar Krishnan (AT&T Labs-Research), Jatin Chhugani, Subodh Kumar (John Hopkins Univ.), Suresh Venkatasubramanian (AT&T Labs-Research)

Large boolean matrices are a basic representational unit in a variety of applications, with some notable examples being interactive visualization systems, mining large graph structures, and association rule mining. Designing space and time efficient scalable storage and query mechanisms for such large matrices is a challenging problem. We present a lossless compression strategy to store and access such large matrices efficiently on disk. Our approach is based on viewing the columns of the matrix as points in a very high dimensional Hamming space, and then formulating an appropriate optimization problem that reduces to solving an instance of the Traveling Salesman Problem on this space. Finding good solutions to large TSP's in high dimensional Hamming spaces is itself a challenging and little-explored problem -- we cannot readily exploit geometry to avoid the need to examine all N^2 inter-city distances and instances can be too large for standard TSP codes to run in main memory. Our multi-faceted approach adapts classical TSP heuristics by means of instance-partitioning and sampling, and may be of independent interest. For instances derived from interactive visualization and telephone call data we obtain significant improvement in access time over standard techniques, and for the visualization application we also make significant improvements in compression.

On the Performance of Bitmap Indices for High Cardinality Attributes
Kesheng Wu, Ekow Otoo, Arie Shoshani (Lawrence Berkeley National Laboratory)

It is well established that bitmap indices are efficient for read-only attributes with low attribute cardinalities. For an attribute with a high cardinality, the size of the bitmap index can be very large. To overcome this size problem, specialized compression schemes are used. Even though there are empirical evidences that some of these compression schemes work well, there has not been any systematic analysis of their effectiveness. In this paper, we systematically analyze the two most efficient bitmap compression techniques, the Byte-aligned Bitmap Code(BBC) and the Word-Aligned Hybrid (WAH) code. Our analyses show that both compression schemes can be optimal. We propose a novel strategy to select the appropriate algorithms so that this optimality is achieved in practice. In addition, our analyses and tests show that the compressed indices are relatively small compared with commonly used indices such as B-trees. Given these facts, we conclude that bitmap index is efficient on attributes of low cardinalities as well as on those of high cardinalities.


Practical Suffix Tree Construction
Sandeep Tata, Richard A. Hankins, Jignesh M. Patel (Univ. of Michigan)

Large string datasets are common in a number of emerging text and biological database applications. Common queries over such datasets include both exact and approximate string matches. These queries can be evaluated very efficiently by using a suffix tree index on the string dataset. Although suffix tree scan be constructed quickly in memory for small input datasets, constructing persistent trees for large datasets has been challenging. In this paper, we explore suffix tree construction algorithms over a wide spectrum of data sources and sizes. First, we show that on modern processors, a cache-efficient algorithm with O(n^2) complexity outperforms the popular O(n) Ukkonen algorithm, even for in-memory construction. For larger datasets, the disk I/O requirement quickly becomes the bottleneck in each algorithm's performance. To address this problem, we present a buffer management strategy for the O(n^2) algorithm, creating a new disk-based construction algorithm that scales to sizes much larger than have been previously described in the literature. Our approach far outperforms the best known disk-based construction algorithms.


RESEARCH SESSION 2:XML VIEWS AND SCHEMAS

Answering XPath Queries over Networks by Sending Minimal Views
Keishi Tajima, Yoshiki Fukui (JAIST)

When a client submits a set of XPath queries to a XML database on a network, the set of answer sets sent back by the database may include redundancy in two ways: some elements may appear in more than one answer set, and some elements in some answer sets may be subelements of other elements in other (or the same) answer sets. Even when a client submits a single query, the answer can be self-redundant because some elements may be subelements of other elements in that answer. Therefore, sending those answers as they are is not optimal with respect to communication costs. In this paper, we propose a method of minimizing communication costs in XPath processing over networks. Given a single or a set of queries, we compute a minimal-size view set that can answer all the original queries. The database sends this view set to the client, and the client produces answers from it. We show algorithms for computing such a minimal view set for given queries. This view set is optimal; it only includes elements that appear in some of the final answers, and each element appears only once.

A Framework for Using Materialized XPath Views in XML Query Processing
Andrey Balmin, Fatma Ozcan, Kevin Beyer, Roberta Cochrane, Hamid Pirahesh (IBM Almaden)

XML languages, such as XQuery, XSLT and SQL/XML, employ XPath as the search and extraction language. XPath expressions often define complicated navigation, resulting in expensive query processing, especially when executed over large collections of documents. In this paper, we propose a framework for exploiting materialized XPath views to expedite processing of XML queries. We explore a class of materialized XPath views, which may contain XML fragments, typed data values, full paths, node references or any combination thereof. We develop an XPath matching algorithm to determine when such views can be used to answer a user query containing XPath expressions. We use the match information to identify the portion of an XPath expression in the user query which is not covered by the XPath view. Finally, we construct, possibly multiple, compensation expressions which need to be applied to the view to produce the query result. Experimental evaluation, using our prototype implementation, shows that the matching algorithm is very efficient and usually accounts for a small fraction of the total query compilation time.

Schema-Free XQuery
Yunyao Li, Cong Yu, H. V. Jagadish (Univ. of Michigan)

The widespread adoption of XML holds out the promise that document structure can be exploited to specify precise database queries. However, the user may have only a limited knowledge of the XML structure, and hence may be unable to produce a correct XQuery, especially in the context of a heterogeneous information collection. The default is to use keyword-based search and we are all too familiar with how difficult it is to obtain precise answers by these means. We seek to address these problems by introducing the notion of Meaningful Lowest Common Ancestor Structure (MLCAS) for finding related nodes within an XML document. By automatically computing MLCAS and expanding ambiguous tag names, we add new functionality to XQuery and enable users to take full advantage of XQuery in querying XML data precisely and efficiently without requiring (perfect) knowledge of the document structure. Such a Schema-Free XQuery is potentially of value not just to casual users with partial knowledge of schema, but also to experts working in a data integration or data evolution context. In such a context, a schema-free query, once written, can be applied universally to multiple data sources that supply similar content under different schemas, and applied "forever" as these schemas evolve. Our experimental evaluation found that it was possible to express a wide variety of queries in a schema-free manner and have them return correct results over a broad diversity of schemas. Furthermore, the evaluation of a schema-free query is not expensive using a novel stack-based algorithm we develop for computing MLCAS: from 1 to 4 times the execution time of an equivalent schema-aware query.


RESEARCH SESSION 3: CONTROLLING ACCESS

Client-Based Access Control Management for XML Documents
Luc Bouganim (INRIA Rocquencourt), Francois Dang Ngoc, Philippe Pucheral (PRiSM Laboratory)

The erosion of trust put in traditional database servers and in Database Service Providers, the growing interest for different forms of data dissemination and the concern for protecting children from suspicious Internet content are different factors that lead to move the access control from servers to clients. Several encryption schemes can be used to serve this purpose but all suffer from a static way of sharing data. With the emergence of hardware and software security elements on client devices, more dynamic client-based access control schemes can be devised. This paper proposes an efficient client-based evaluator of access control rules for regulating access to XML documents. This evaluator takes benefit from a dedicated index to quickly converge towards the authorized parts of a - potentially streaming - document. Additional security mechanisms guarantee that prohibited data can never be disclosed during the processing and that the input document is protected from any form of tampering. Experiments on synthetic and real datasets demonstrate the effectiveness of the approach.

Secure XML Publishing without Information Leakage in the Presence of Data Inference
Xiaochun Yang (Northeastern Univ.), Chen Li (Univ. of California, Irvine)

Recent applications are seeing an increasing need that publishing XML documents should meet precise security requirements. In this paper, we consider data-publishing applications where the publisher specifies what information is sensitive and should be protected. We show that if a partial document is published carelessly, users can use common knowledge (e.g., ``all patients in the same ward have the same disease'') to infer more data, which can cause leakage of sensitive information. The goal is to protect such information in the presence of data inference with common knowledge. We consider common knowledge represented as semantic XML constraints. We formulate the process how users can infer data using three types of common XML constraints. Interestingly, no matter what sequences users follow to infer data, there is a unique, maximal document that contains all possible inferred documents. We develop algorithms for finding a partial document of a given XML document without causing information leakage, while allowing publishing as much data as possible. Our experiments on real data sets show that effect of inference on data security, and how the proposed techniques can prevent such leakage from happening.

Limiting Disclosure in Hippocratic Databases
Kristen LeFevre (Univ. of Wisconsin), Rakesh Agrawal (IBM Almaden), Vuk Ercegovac, Raghu Ramakrishnan (Univ. of Wisconsin),Yirong Xu (IBM Almaden), David DeWitt (Univ. of Wisconsin)

We present a practical and efficient approach to incorporating privacy policy enforcement into an existing application and database environment, and we explore some of the semantics tradeoffs introduced by enforcing these rules at cell-level granularity. Through a comprehensive set of performance experiments, we show that the cost of privacy enforcement is small, and scalable to large databases.


RESEARCH SESSION 4: XML (I)

On Testing Satisfiability of Tree Pattern Queries
Laks Lakshmanan, Ganesh Ramesh, Hui Wang, Zheng Zhao (Univ. of British Columbia)

XPath and XQuery (which includes XPath as a sublanguage) are the major query languages for XML. An important issue arising in efficient evaluation of queries expressed in these languages is satisfiability, i.e., whether there exists a database, consistent with the schema if one is available, on which the query has a non-empty answer. Our experience shows satisfiability check can effect substantial savings in query evaluation. We systematically study satisfiability of tree pattern queries (which capture a useful fragment of XPath) together with additional constraints, with or without a schema. We identify cases in which this problem can be solved in polynomial time and develop novel efficient algorithms for this purpose. We also show that in several cases, the problem is NP-complete. We ran a comprehensive set of experiments to verify the utility of satisfiability check as a preprocessing step in query processing. Our results show that this check takes a negligible fraction of the time needed for processing the query while often yielding substantial savings.

Containment of Nested XML Queries
Xin Dong, Alon Halevy, Igor Tatarinov (Univ. of Washington)

Query containment is the most fundamental relationship between a pair of database queries: a query Q is said to be contained in a query Q' if the answer for Q is always a subset of the answer for Q',independent of the current state of the database. Query containment is an important problem in a wide variety of data management applications, including verification of integrity constraints, reasoning about contents of data sources in data integration, semantic caching, verification of knowledge bases, determining queries independent of updates, and most recently, in query reformulation for peer data management systems. Query containment has been studied extensively in the relational context and for XPath queries, but not for XML queries with nesting. We consider the theoretical aspects of the problem of query containment for XML queries with nesting. We begin by considering conjunctive XML queries (c-XQueries), and show that containment is in polynomial time if we restrict the fan-out (number of sibling sub-blocks) to be 1. We prove that for arbitrary fan-out, containment is coNP-hard already for queries with nesting depth 2, even if the query does not include variables in the return clauses. We then show that for queries with fixed nesting depth, containment is coNP-complete. Next, we establish the computational complexity of query containment for several practical extensions of c-XQueries, including queries with union and arithmetic comparisons, and queries where the XPath expressions may include descendant edges and negation. Finally, we describe a few heuristics for speeding up query containment checking in practice by exploiting properties of the queries and the underlying schema.

Efficient XML-to-SQL Query Translation: Where to Add the Intelligence ?
Rajasekar Krishnamurthy (IBM Almaden), Raghav Kaushik (Microsoft Research), Jeffrey Naughton (Univ. of Wisconsin-Madison)

We consider the efficiency of queries generated by XML to SQL translation. We first show that published XML-to-SQL query translation algorithms are suboptimal in that they often translate simple path expressions into complex SQL queries even when much simpler equivalent SQL queries exist. There are two logical ways to deal with this problem. One could generate suboptimal SQL queries using a fairly naive translation algorithm, and then attempt to optimize the resulting SQL; or one could use a more intelligent translation algorithm with the hopes of generating efficient SQL directly. We show that optimizing the SQL after it is generated is problematic, becoming intractable even in simple scenarios; by contrast, designing a translation algorithm that exploits information readily available at translation time is a promising alternative. To support this claim, we present a translation algorithm that exploits translation time information to generate efficient SQL for path expression queries over tree schemas.

Taming XPath Queries by Minimizing Wildcard Steps
Chee-Yong Chan (National Univ. of Singapore), Wenfei Fan (Univ. of Edinburgh & Bell Laboratories), Yiming Zeng (National Univ. of Singapore)

This paper presents a novel and complementary technique to optimize an XPath query by minimizing its wildcard steps. Our approach is based on using a general composite axis called the layer axis, to rewrite a sequence of XPath steps (all of which are wildcard steps except for possibly the last)into a single layer-axis step. We describe an efficient implementation of the layer axis and present a novel and efficient rewriting algorithm to minimize both non-branching as well as branching wildcard steps in XPath queries. We also demonstrate the usefulness of wildcard-step elimination by proposing an optimized evaluation strategy for wildcard-free Path queries that enables selective loading of only the relevant input XML data for query evaluation. Our experimental results not only validate the scalability and efficiency of our optimized evaluation strategy, but also demonstrate the effectiveness of our rewriting algorithm for minimizing wildcard steps in XPath queries. To the best of our knowledge, this is the first effort that addresses this new optimization problem.

The NEXT Framework for Logical Xquery Optimization
Alin Deutsch, Yannis Papakonstantinou, Yu Xu (Univ. of California, San Diego)

Classical logical optimization techniquesrely on a logical semantics of the query language. The adaptation of these techniques to XQuery is precluded by its definition as a functional language with operational semantics. We introduce Nested XML Tableaux which enable a logical foundation for XQuery semantics and provide the logical plan optimization framework of our XQuery processor. As a proof of concept, we develop and evaluate a minimization algorithm for removing redundant navigation within and across nested subqueries. The rich XQuery features create key challenges that fundamentally extend the prior work on the problems of minimizing conjunctive and tree pattern queries.


RESEARCH SESSION 5:STREAM MINING

Detecting Change in Data Streams
Daniel Kifer, Shai Ben-David, Johannes Gehrke (Cornell Univ.)

Detecting changes in a data stream is an important area of research with many applications. In this paper, we present a novel method for the detection and estimation of change. In addition to providing statistical guarantees on the reliability of detected changes, our method also provides meaningful descriptions and quantification of these changes. Our approach assumes that the points in the stream are independently generated, but otherwise makes no assumptions on the nature of the generating distribution. Thus our techniques work for both continuous and discrete data. In an experimental study we demonstrate the power of our techniques.

Stochastic Consistency, and Scalable Pull-Based Caching for Erratic Data Stream Sources
Shanzhong Zhu, Chinya V. Ravishankar (Univ.of California, Riverside)

We introduce the notion of stochastic consistency, and propose a novel approach to achieving it for caches of highly erratic data. Erratic data sources, such as stock prices, sensor data, and network traffic statistics, are common and important in practice. However their erratic patterns of change can make caching hard. Stochastic consistency guarantees that errors in cached values of erratic data remain within a user-specified bound, with a user-specified probability. We use a Brownian motion model to capture the behavior of data changes, and use its underlying theory to predict when caches should initiate pulls to refresh cached copies to maintain stochastic consistency. Our approach allows servers to remain totally stateless, thus achieving excellent scalability and reliability. We also discuss a new real-time scheduling approach for servicing pull requests at the server. Our scheduler delivers prompt response whenever possible, and minimizes the aggregate cache-source deviation due to delays during server overload. We conduct extensive experiments to validate our model on real-life datasets, and show that our scheme outperforms current schemes.

False Positive or False Negative: Mining Frequent Item sets from High Speed Transactional Data Streams
Jeffrey Xu Yu (The Chinese Univ. of Hong Kong), Zhihong Chong (Fudan Univ.), Hongjun Lu (The Hong Kong Univ. of Science and Technology), Aoying Zhou (Fudan Univ.)

The problem of finding frequent items has been recently studied over high speed data streams. However, mining frequent item sets from transactional data streams has not been well addressed yet in terms of its bounds of memory consumption. The main difficulty is due to the nature of the exponential explosion of item sets. Given a domain of Iunique items, the possible number of item sets can be up to2**I-1. When the length of data streams approaches to a very large number N, the possibility of an item set to be frequent becomes larger and difficult to track with limited memory. However, the real killer of effective frequent item set mining is that most of existing algorithms are false-positive oriented. That is, they control memory consumption in the counting processes by an error parameter e, and allow items with support below the specified minimum support s but above s-e counted as frequent ones. Such false-positive items increase the number of false-positive frequent item sets exponentially, which may make the problem computationally intractable with bounded memory consumption. In this paper, we developed algorithms that can effectively mine frequent item(set)s from high speed transactional data streams with a bound of memory consumption. While our algorithms are false-negative oriented, that is, certain frequent item sets may not appear in the results, the number of false-negative item sets can be controlled by a predefined parameter so that desired recall rate of frequent item sets can be guaranteed. We developed algorithms based on Chernoff bound. Our extensive experimental studies show that the proposed algorithms have high accuracy, require less memory, and consume less CPU time. They significantly outperform the existing false-positive algorithms.


INDUSTRIAL SESSION 1: NOVEL SQL EXTENSIONS

Returning Modified Rows - SELECT Statements with Side Effects
Andreas Behm, Serge Rielau, Richard Swagerman (IBM Toronto Lab)

SQL in the IBM® DB2® Universal Database™ for Linux®, UNIX®, and Windows® (DB2 UDB) database management product has been extended to support nested INSERT, UPDATE, and DELETE operations in SELECT statements. This allows database applications additional processing on modified rows. Within a single unit of work, applications can retrieve a result set containing the modified rows from a table or view modified by an SQL data-change operation. This eliminates the need to select the row after an INSERT or UPDATE, or before a DELETE statement. As a result, fewer network round trips, less server CPU time, fewer cursors, and less server memory are required. In addition, deadlocks can be avoided. The proposed approach is integrated with the set semantics of SQL, and does not require any procedural logic or modifications on the underlying relational data model. Pipelining multiple update, insert and delete operations using the same source data provides a very efficient way for multi-table data-change statements typically found in ETL (extraction, transformation, load) applications. We demonstrate significant performance benefit with our experiences in the TPC-C benchmark. Experimental results show that the new SQL is more efficient in query execution compared to classic SQL.

PIVOT and UNPIVOT: Optimization and Execution Strategies in an RDBMS and Scalable Pull-Based Caching for Erratic Data Sources
Conor Cunningham, Cesar Galindo-Legaria, Goetz Graefe (Microsoft Corp.)

PIVOT and UNPIVOT, two operators on tabular data that exchange rows and columns, enable data transformations useful in data modeling, data analysis, and data presentation. They can quite easily be implemented inside a query processor, much like select, project, and join. Such a design provides opportunities for better performance, both during query optimization and query execution. We discuss query optimization and execution implications of this integrated design and evaluate the performance of this approach using a prototype implementation in Microsoft SQL Server.

A Multi-Purpose Implementation of Mandatory Access Control in Relational Database Management Systems
Walid Rjaibi, Paul Bird (IBM Toronto Lab)

Mandatory Access Control (MAC) implementations in Relational Database Management Systems (RDBMS) have focused solely on Multilevel Security (MLS). MLS has posed a number of challenging problems to the database research community, and there has been an abundance of research work to address those problems. Unfortunately, the use of MLS RDBMS has been restricted to a few government organizations where MLS is of paramount importance such as the intelligence community and the Department of Defense. The implication of this is that the investment of building an MLS RDBMS cannot be leveraged to serve the needs of application domains where there is a desire to control access to objects based on the label associated with that object and the label associated with the subject accessing that object, but where the label access rules and the label structure do not necessarily match the MLS two security rules and the MLS label structure. This paper introduces a flexible and generic implementation of MAC in RDBMS that can be used to address the requirements from a variety of application domains, as well as to allow an RDBMS to efficiently take part in an end-to-end MAC enterprise solution. The paper also discusses the extensions made to the SQL compiler component of an RDBMS to incorporate the label access rules in the access plan it generates for an SQL query, and to prevent unauthorized leakage of data that could occur as a result of traditional optimization techniques performed by SQL compilers.


RESEARCH SESSION 6:XML (II)

Indexing Temporal XML Documents
Alberto Mendelzon, Flavio Rizzolo (Univ. of Toronto), Alejandro Vaisman (Univ. of Buenos Aires)

Different models have been proposed recently for representing temporal data, tracking historical information, and recovering the state of the document as of any given time, in XML documents. We address the problem of indexing temporal XML documents. In particular we show that by indexing "continuous paths", i.e. paths that are valid continuously during a certain interval in a temporal XML graph, we can dramatically increase query performance. We describe in detail the indexing scheme, denoted TempIndex, and compare its performance against both a system based on a non-temporal path index, and one based on DOM.

Schema-based Scheduling of Event Processors and Buffer Minimization for Queries on Structured Data Streams
Christoph Koch, Stefanie Scherzinger (Technische Universitat Wien), Nicole Schweikardt (Humboldt Universitat zu Berlin), Bernhard Stegmaier (Technische Universitat Munchen)

We introduce an extension of the XQuery language, FluX, that supports event-based query processing and the conscious handling of main memory buffers. Purely event-based queries of this language can be executed on streaming XML data in a very direct way. We then develop an algorithm that allows to efficiently rewrite XQueries into the event-based FluX language. This algorithm uses order constraints from a DTD to schedule event handlers and to thus minimize the amount of buffering required for evaluating a query. We discuss the various technical aspects of query optimization and query evaluation within our framework. This is complemented with an experimental evaluation of our approach.

Bloom Histogram: Path Selectivity Estimation for XML Data with Updates
Wei Wang (Univ. of NSW), Haifeng Jiang, Hongjun Lu (Hong Kong Univ. of Science and Technology), Jeffrey Xu Yu (The Chinese Univ. of Hong Kong)

Cost-based XML query optimization calls for accurate estimation of the selectivity of path expressions. Some other interactive and internet applications can also benefit from such estimations. While there are a number of estimation techniques proposed in the literature, almost none of them has any guarantee on the estimation accuracy within a given space limit. In addition, most of them assume that the XML data are more or less static, i.e., with few updates. In this paper, we present a framework for XML path selectivity estimation in a dynamic context. Specifically, we propose a novel data structure, bloom histogram, to approximate XML path frequency distribution within a small space budget and to estimate the path selectivity accurately with the bloom histogram. We obtain the upper bound of its estimation error and discuss the trade-offs between the accuracy and the space limit. To support updates of bloom histograms efficiently when underlying XML data change, a dynamic summary layer is used to keep exact or more detailed XML path information. We demonstrate through our extensive experiments that the new solution can achieve significantly higher accuracy with an even smaller space than the previous methods in both static and dynamic environments.


INDUSTRIAL SESSION 2: NEW DBMS ARCHITECTURES AND PERFORMANCE

Hardware Acceleration in Commercial Databases: A Case Study of Spatial Operations
Nag Nagender Bandi (Univ. of California, Santa Barbara), Chengyu Sun (California State Univ., Los Angeles), Divyakant Agrawal, Amr El Abbadi (Univ. of California, Santa Barbara)

Traditional databases have focused on the issue of reducing I/O cost as it is the bottleneck in many operations. As databases become increasingly accepted in areas such as Geographic Information Systems (GIS) and Bio-informatics, commercial DBMS need to support data types for complex data such as spatial geometries and protein structures. These non-conventional data types and their associated operations present new challenges. In particular, the computational cost of some spatial operations can be orders of magnitude higher than the I/O cost. In order to improve the performance of spatial query processing, innovative solutions for reducing this computational cost are beginning to emerge. Recently, it has been proposed that hardware acceleration of an off-the-shelf graphics card can be used to reduce the computational cost of spatial operations. However, this proposal is preliminary in that it establishes the feasibility of the hardware assisted approach in a stand-alone setting but not in a real-world commercial database. In this paper we present an architecture to show how hardware acceleration of an off-the-shelf graphics card can be integrated into a popular commercial database to speed up spatial queries. Extensive experimentation with real-world datasets shows that significant improvement in the performance of spatial operations can be achieved with this integration. The viability of this approach underscores the significance of a tighter integration of hardware acceleration into commercial databases for spatial applications.

P*TIME: Highly Scalable OLTP DBMS for Managing Update-Intensive Stream Workload
Sang K. Cha (Transact In Memory, Inc.), Changbin Song (Seoul National Univ.)

Over the past thirty years since the system R and Ingres projects started to lay the foundation for today's RDBMS implementations, the underlying hardware and software platforms have changed dramatically. However, the fundamental RDBMS architecture, especially, the storage engine architecture, largely remains unchanged. While this conventional architecture may suffices for satisfying most of today's applications, its deliverable performance range is far from meeting the so-called growing "real-time enterprise" demand of acquiring and querying high-volume update data streams cost-effectively. P*TIME is a new, memory-centric light-weight OLTP RDBMS designed and built from scratch to deliver orders of magnitude higher scalability on commodity SMP hardware than existing RDBMS implementations, not only in search but also in update performance. Its storage engine layer incorporates our previous innovations for exploiting engine-level micro-parallelism such as differential logging and optimistic latch-free index traversal concurrency control protocol. This paper presents the architecture and performance of P*TIME and reports our experience of deploying P*TIME as the stock market database server at one of the largest on-line brokerage firms.

Generating Thousand Benchmark Queries in Seconds
Meikel Poess (Oracle Corp.), John M. Stephens, Jr. (Gradient Systems)

The combination of an exponential growth in the amount of data managed by a typical business intelligence system and the increased competitiveness of a global economy has propelled decision support systems (DSS) from the role of exploratory tools employed by a few visionary companies to become a core requirement for a competitive enterprise. That same maturation has often resulted in a selection process that requires an ever more critical system evaluation and selection to be completed in an increasingly short period of time. While there have been some advances in the generation of data sets for system evaluation (see [3]), the quantification of query performance has often relied on models and methodologies that were developed for systems that were more simplistic, less dynamic, and less central to a successful business. In this paper we present QGEN, a flexible, high-level query generator optimized for decision support system evaluation. QGEN is able to generate arbitrary query sets, which conform to a selected statistical profile without requiring that the queries be statically defined or disclosed prior to testing. Its novel design links query syntax with abstracted data distributions, enabling users to parameterize their query workload to match an emerging access pattern or data set modification. This results in query sets that retain comparability for system comparisons while reflecting the inherent dynamism of operational systems, and which provide a broad range of syntactic and semantic coverage, while remaining focused on appropriate commonalities within a particular evaluation process or business segment.


RESEARCH SESSION 7:XML AND RELATIONS

XQuery on SQL Hosts
Torsten Grust, Sherif Sakr, Jens Teubner (Univ. of Konstanz)

Relational database systems may be turned into efficient XML and XPath processors if the system is provided with a suitable relational tree encoding. This paper extends this relational XML processing stack and shows that an RDBMS can also serve as a highly efficient XQuery runtime environment. Our approach is purely relational: XQuery expressions are compiled into SQL code which operates on the tree encoding. The core of the compilation procedure trades XQuery's notions of variable scopes and nested iteration (FLWOR blocks) for equi-joins. The resulting relational XQuery processor closely adheres to the language semantics, e.g., it obeys node identity as well as document and sequence order, and can support XQuery's full axis feature. The system exhibits quite promising performance figures in experiments. Somewhat unexpectedly, we will also see that the XQuery compiler can make good use of SQL's OLAP functionality.

ROX: Relational Over XML
Alan Halverson (Univ. of Wisconsin-Madison), Vanja Josifovski, Guy Lohman, Hamid Pirahesh (IBM Almaden), Mathias Moerschel

An increasing percentage of the data needed by business applications is being generated in XML format. Storing the XML in its native format will facilitate new applications that exchange business objects in XML format and query portions of XML documents using XQuery. This paper explores the feasibility of accessing natively-stored XML data through traditional SQL interfaces, called Relational Over XML (ROX), in order to avoid the costly conversion of legacy applications to XQuery. It describes the forces that are driving the industry to evolve toward the ROX scenario as well as some of the issues raised by ROX. The impact of denormalization of data in XML documents is discussed both from a semantic and performance perspective. We also weigh the implications of ROX for manageability and query optimization. We experimentally compared the performance of a prototype of the ROX scenario to today's SQL engines, and found that good performance can be achieved through a combination of utilizing XML's hierarchical storage to store relations "pre-joined" as well as creating indices over the remaining join columns. We have developed an experimental framework using DB2 8.1 for Linux, Unix and Windows, and have gathered initial performance results that validate this approach.

From XML View Updates to Relational View Updates: old solutions to a new problem
Vanessa Braganholo (Universidade Federal do Rio Grande do Sul), Susan Davidson (Univ. of Pennsylvania, and INRIA-FUTURS), Carlos Heuser (Universidade Federal do Rio Grande do Sul)

This paper addresses the question of updating relational databases through XML views. Using "query trees" to capture the notions of selection, projection, nesting, grouping, and heterogeneous sets found throughout most XML query languages, we show how XML views expressed using query trees can be mapped to a set of corresponding relational views. We then show how updates on the XML view are mapped to updates on the corresponding relational views. Existing work on updating relational views can then be leveraged to determine whether or not the relational views are updatable with respect to the relational updates, and if so, to translate the updates to the underlying relational database.


RESEARCH SESSION 8:STREAM MINING (II)

XWAVE: Optimal and Approximate Extended Wavelets for Streaming Data
Sudipto Guha (Univ. of Pennsylvania), Chulyun Kim, Kyuseok Shim (Seoul National Univ.)

Wavelet synopses have been found to be of interest in query optimization and approximate query answering. Recently, extended wavelets were proposed by Deligiannakis and Roussopoulos for datasets containing multiple measures. Extended wavelets optimize the storage utilization by attempting to store the same wavelet coefficient across different measures. This reduces the bookkeeping overhead and more coefficients can be stored. An optimal algorithm for minimizing the error in representation and an approximation algorithm for the complementary problem was provided. However, both their algorithms take linear space. Synopsis structures are often used in environments where space is at a premium and the data arrives as a continuous stream which is too expensive to store. In this paper, we give algorithms for extended wavelets which are space sensitive, i.e., use space which is dependent on the size of the synopsis (and at most on the logarithm of the total data) and operates in a streaming fashion. We present better optimal algorithms based on dynamic programming and a near optimal approximate greedy algorithm. We also demonstrate the performance benefits of our algorithms compared to previous ones through experiments on real-life and synthetic datasets.

REHIST: Relative Error Histogram Construction Algorithms
Sudipto Guha (Univ. of Pennsylvania), Kyuseok Shim, Jungchul Woo (Seoul National Univ.)

Histograms and Wavelet synopses provide useful tools in query optimization and approximate query answering. Traditional histogram construction algorithms, such as V-Optimal, optimize absolute error measures for which the error in estimating a true value of 10 by 20 has the same effect of estimating a true value of1000 by 1010. However, several researchers have recently pointed out the drawbacks of such schemes and proposed wavelet based schemes to minimize relative error measures. None of these schemes provide satisfactory guarantees -- and we provide evidence that the difficulty may lie in the choice of wavelets as the representation scheme. In this paper, we consider histogram construction for the known relative error measures. We develop optimal as well as fast approximation algorithms. We provide a comprehensive theoretical analysis and demonstrate the effectiveness of these algorithms in providing significantly more accurate answers through synthetic and real life data sets.

Distributed Set-Expression Cardinality Estimation
Abhinandan Das (Cornell Univ.), Sumit Ganguly (IIT Kanpur), Minos Garofalakis, Rajeev Rastogi (Bell Labs, Lucent Technologies)

We consider the problem of estimating set-expression cardinality in a distributed streaming environment where rapid update streams originating at remote sites are continually transmitted to a central processing system. At the core of our algorithmic solutions for answering set-expression cardinality queries are two novel techniques for lowering data communication costs without sacrificing answer precision. Our first technique exploits global knowledge of the distribution of certain frequently occurring stream elements to significantly reduce the transmission of element state information to the central site. Our second technical contribution involves a novel way of capturing the semantics of the input set expression in a boolean logic formula, and using models (of the formula) to determine whether an element state change at a remote site can affect the set expression result. Results of our experimental study with real-life as well as synthetic data sets indicate that our distributed set-expression cardinality estimation algorithms achieve substantial reductions in message traffic compared to naive approaches that provide the same accuracy guarantees.


INDUSTRIAL SESSION 3: SEMANTIC QUERY APPROACHES

Supporting Ontology-Based Semantic Matching in RDBMS
Souripriya Das, Eugene Chong, George Eadon, Jagannathan Srinivasan (Oracle Corp.)

Ontologies are increasingly being used to build applications that utilize domain-specific knowledge. This paper addresses the problem of supporting ontology-based semantic matching in RDBMS. Specifically, 1) A set of SQL operators, namely ONT_RELATED, ONT_EXPAND, ONT_DISTANCE, and ONT_PATH, are introduced to perform ontology-based semantic matching, 2) A new indexing scheme ONT_INDEXTYPE is introduced to speed up ontology-based semantic matching operations, and 3) System-defined tables are provided for storing ontologies specified in OWL. Our approach enables users to reference ontology data directly from SQL using the semantic match operators, thereby opening up possibilities of combining with other operations such as joins as well as making the ontology-driven applications easy to develop and efficient. In contrast, other approaches use RDBMS only for storage of ontologies and querying of ontology data is typically done via APIs. This paper presents the ontology-related functionality including inferencing, discusses how it is implemented on top of Oracle RDBMS, and illustrates the usage with several database applications.

BioPatentMiner: An Information Retrieval System for BioMedical Patents
Sougata Mukherjea, Bhuvan Bamba (IBM India Research Lab)

Before undertaking new biomedical research, identifying concepts that have already been patented is essential. Traditional keyword based search on patent databases may not be sufficient to retrieve all the relevant information, especially for the biomedical domain. More sophisticated retrieval techniques are required. This paper presents BioPatentMiner, a system that facilitates information retrieval from biomedical patents. It integrates information from the patents with knowledge from biomedical ontologies to create a Semantic Web. Besides keyword search and queries linking the properties specified by one or more RDF triples, the system can discover Semantic Associations between the resources. The system also determines the importance of the resources to rank the results of a search and prevent information overload while determining the Semantic Associations.

Flexible String Matching Against Large Databases in Practice
Nick Koudas, Amit Marathe, Divesh Srivastava (AT&T Labs-Research)

Data Cleaning is an important process that has been at the center of research interest in recent years. Poor data quality is the result of a variety of reasons, including data entry errors and multiple conventions for recording database fields, and has a significant impact on a variety of business issues. Hence, there is a pressing need for technologies that enable flexible (fuzzy) matching of string information in a database. Cosine similarity with tf-idf is a well-established metric for comparing text, and recent proposals have adapted this similarity measure for flexibly matching a query string with values in a single attribute of a relation. In deploying tf-idf based flexible string matching against real AT&T databases, we observed that this technique needed to be enhanced in many ways. First, along the functionality dimension, where there was a need to flexibly match along multiple string-valued attributes, and also take advantage of known semantic equivalences. Second, we identified various performance enhancements to speed up the matching process, potentially trading off a small degree of accuracy for substantial performance gains. In this paper, we report on our techniques and experience in dealing with flexible string matching against real AT&T databases.


RESEARCH SESSION 9:STREAM QUERY PROCESSING

Memory-Limited Execution of Windowed Stream Joins
Utkarsh Srivastava, Jennifer Widom (Stanford Univ.)

We address the problem of computing approximate answers to continuous sliding-window joins over data streams when the available memory may be insufficient to keep the entire join state. One approximation scenario is to provide a maximum subset of the result, with the objective of losing as few result tuples as possible. An alternative scenario is to provide a random sample of the join result, e.g., if the output of the join is being aggregated. We show formally that neither approximation can be addressed effectively for a sliding-window join of arbitrary input streams. Previous work has only addressed the maximum-subset problem, and has implicitly used a frequency-based model of stream arrival. We address the sampling problem for this model. More importantly, we point out a broad class of applications for which an age-based model of stream arrival is more appropriate, and we address both approximation scenarios under this new model. Finally, for the case of multiple joins being executed with an overall memory constraint, we provide an algorithm for memory allocation across the joins that optimizes a combined measure of approximation in all scenarios considered. All of our algorithms are implemented and experimental results demonstrate their effectiveness.

Resource Sharing in Continuous Sliding-Window Aggregates
Arvind Arasu, Jennifer Widom (Stanford Univ.)

We consider the problem of resource sharing when processing large numbers of continuous queries. We specifically address sliding-window aggregates over data streams, an important class of continuous operators for which sharing has not been addressed. We present a suite of sharing techniques that cover a wide range of possible scenarios: different classes of aggregation functions (algebraic, distributive, holistic), different window types (time-based, tuple-based, suffix, historical), and different input models (single stream, multiple substreams). We provide precise theoretical performance guarantees for our techniques, and show their practical effectiveness through a thorough experimental study.

Remembrance of Streams Past: Overload-Sensitive Management of Archived Streams
Sirish Chandrasekaran, Michael J. Franklin (UC Berkeley)

This paper studies Data Stream Management Systems that combine real-time data streams with historical data, and hence access incoming streams and archived data simultaneously. A significant problem for these systems is the I/O cost of fetching historical data which inhibits processing of the live data streams. Our solution is to reduce the I/O cost for accessing the archive by retrieving only a reduced (summarized or sampled) version of the historical data. This paper does not propose new summarization or sampling techniques, but rather a framework in which multiple resolutions of summarization/sampling can be generated efficiently. The query engine can select the appropriate level of summarization to use depending on the resources currently available. The central research problem studied is whether to generate the multiple representations of archived data eagerly upon data-arrival, lazily at query-time, or in a hybrid fashion. Concrete techniques for each approach are presented, which are tied to a specific data reduction technique (random sampling). The tradeoffs among the three approaches are studied both analytically and experimentally.

WIC: A General-Purpose Algorithm for Monitoring Web Information Sources
Sandeep Pandey, Kedar Dhamdhere, Christopher Olston (Carnegie Mellon Univ.)

The Web is becoming a universal information dissemination medium, due to a number of factors including its support for content dynamicity. A growing number of Web information providers post near real-time updates in domains such as auctions, stock markets, bulletin boards, news, weather, roadway conditions, sports scores, etc. External parties often wish to capture this information for a wide variety of purposes ranging from online data mining to automated synthesis of information from multiple sources. There has been a great deal of work on the design of systems that can process streams of data from Web sources, but little attention has been paid to how to produce these data streams, given that Web pages generally require ``pull-based'' access. In this paper we introduce a new general-purpose algorithm for monitoring Web information sources, effectively converting pull-based sources into push-based ones. Our algorithm can be used in conjunction with continuous query systems that assume information is fed into the query engine in a push-based fashion. Ideally, a Web monitoring algorithm for this purpose should achieve two objectives: (1) timeliness and (2) completeness of information captured. However, we demonstrate both analytically and empirically using real-world data that these objectives are fundamentally at odds. When resources available for Web monitoring are limited, and the number of sources to monitor is large, it may be necessary to sacrifice some timeliness to achieve better completeness, or vice versa. To take this fact into account, our algorithm is highly parameterized and targets an application-specified balance between timeliness and completeness. In this paper we formalize the problem of optimizing for a flexible combination of timeliness and completeness, and prove that our parameterized algorithm is a 2-approximation in all cases, and in certain cases is optimal.


RESEARCH SESSION 10:MANAGING WEB INFORMATION SOURCES

Similarity Search for Web Services
Xin Dong, Alon Halevy, Jayant Madhavan, Ema Nemes, Jun Zhang (Univ. of Washington)

Web services are loosely coupled software components, published, located, and invoked across the web. The growing number of web services available within an organization and on the Web raises a new and challenging search problem: locating desired web services. Traditional keyword search is insufficient in this context: the specific types of queries users require are not captured, the very small text fragments in web services are unsuitable for keyword search, and the underlying structure and semantics of the web services are not exploited. We describe the algorithms underlying the Woogle search engine for webservices. Woogle supports similarity search for web services, such as finding similar web-service operations and finding operations that compose with a given one. We describe novel techniques to support these types of searches, and an experimental study on a collection of over 1500 web-service operations that shows the high recall and precision of our algorithms.

AWESOME - A Data Warehouse-based System for Adaptive Website Recommendations
Andreas Thor, Erhard Rahm (Univ. of Leipzig)

Recommendations are crucial for the success of large websites. While there are many ways to determine recommendations, the relative quality of these recommenders depends on many factors and is largely unknown. We propose a new classification of recommenders and comparatively evaluate their relative quality for a sample website. The evaluation is performed with AWESOME (Adaptive WEbSite recOMmEndations), a new data warehouse-based recommendation system capturing and evaluating user feedback on presented recommendations. Moreover, we show how AWESOME performs an automatic and adaptive closed-loop website optimization by dynamically selecting the most promising recommenders based on continuously measured recommendation feedback. We propose and evaluate several alternatives for dynamic recommender selection including a powerful machine learning approach.

Accurate and Efficient Crawling for Relevant Websites
Martin Ester (Simon Fraser Univ.), Hans-Peter Kriegel, Matthias Schubert (Univ. of Munich)

Focused web crawlers have recently emerged as an alternative to the well-established web search engines. While the well-known focused crawlers retrieve relevant web pages, there are various applications which target whole websites instead of single web pages. For example, companies are represented by websites, not by individual web pages. To answer queries targeted at websites, web directories are an established solution. In this paper, we introduce a novel focused website crawler to employ the paradigm of focused crawling for the search of relevant websites. The proposed crawler is based on a two-level architecture and corresponding crawl strategies with an explicit concept of websites. The external crawler views the web as a graph of linked websites, selects the websites to be examined next and invokes internal crawlers. Each internal crawler views the web pages of a single given website and performs focused (page) crawling within that website. Our experimental evaluation demonstrates that the proposed focused website crawler clearly outperforms previous methods of focused crawling which were adapted to retrieve websites instead of single web pages.

Instance-based Schema Matching for Web Databases by Domain-specific Query Probing
Jiying Wang (Hong Kong Univ. of Science and Technology), Ji-Rong Wen (Microsoft Research Asia), Fred Lochovsky (Hong Kong Univ. of Science and Technology), Wei-Ying Ma (Microsoft Research Asia)

In a Web database that dynamically provides information in response to user queries, two distinct schemas, interface schema (the schema users can query) and result schema (the schema users can browse), are presented to users. Each partially reflects the actual schema of the Web database. Most previous work only studied the problem of schema matching across query interfaces of Web databases. In this paper, we propose a novel schema model that distinguishes the interface and the result schema of a Web database in a specific domain. In this model, we address two significant Web database schema-matching problems: intra-site and inter-site. The first problem is crucial in automatically extracting data from Web databases, while the second problem plays a significant role in meta-retrieving and integrating data from different Web databases. We also investigate a unified solution to the two problems based on query probing and instance-based schema matching techniques. Using the model, a cross validation technique is also proposed to improve the accuracy of the schema matching. Our experiments on real Web databases demonstrate that the two problems can be solved simultaneously with high precision and recall.


RESEARCH SESSION 11:DISTRIBUTED SEARCH AND QUERY PROCESSING

Computing PageRank in a Distributed Internet Search Engine System
Yuan Wang, David DeWitt (University of Wisconsin - Madison)

Existing Internet search engines use web crawlers to download data from the Web. Page quality is measured on central servers, where user queries are also processed. This paper argues that using crawlers has a list of disadvantages. Most importantly, crawlers do not scale. Even Google, the leading search engine, indexes less than 1% of the entire Web. This paper proposes a distributed search engine framework, in which every web server answers queries over its own data. Results from multiple web servers will be merged to generate a ranked hyperlink list on the submitting server. This paper presents a series of algorithms that compute PageRank in such framework. The preliminary experiments on a real data set demonstrate that the system achieves comparable accuracy on PageRank vectors to Google's well-known PageRank algorithm and, therefore, high quality of query results.

Enhancing P2P File-Sharing with an Internet-Scale Query Processor
Boon Thau Loo (UC Berkeley), Joseph M. Hellerstein (UC Berkeley and Intel Research Berkeley), Ryan Huebsch (UC Berkeley), Scott Shenker (UC Berkeley and International Computer Science Institute), Ion Stoica (UC Berkeley)

In this paper, we address the problem of designing a scalable, accurate query processor for peer-to-peer filesharing and similar distributed keyword search systems. Using a globally-distributed monitoring infrastructure, we perform an extensive study of the Gnutella file sharing network, characterizing its topology, data and query workloads. We observe that Gnutella's query processing approach performs well for popular content, but quite poorly for rare items with few replicas. We then consider an alternate approach based on Distributed Hash Tables (DHTs). We describe our implementation of PIER Search, a DHT-based system, and propose a hybrid system where Gnutella is used to locate popular items, and PIER Search for handling rare items. We develop an analytical model of the two approaches, and use it in concert with our Gnutella traces to study the trade-off between query recall and system overhead of the hybrid system. We evaluate a variety of localized schemes for identifying items that are rare and worth handling via the DHT. Lastly, we show in a live deployment on fifty nodes on two continents that it nicely complements Gnutella in its ability to handle rare items.

Online Balancing of Range-Partitioned Data with Applications to Peer-to-Peer Systems
Prasanna Ganesan, Mayank Bawa, Hector Garcia-Molina (Stanford Univ.)

We consider the problem of horizontally partitioning a dynamic relation across a large number of disks/nodes by the use of range partitioning. Such partitioning is often desirable in large-scale parallel databases, a swell as in peer-to-peer (P2P) systems. As tuples are inserted and deleted, the partitioning may need to be adjusted, and data moved, in order to achieve storage balance across the participant disks/nodes. We propose efficient, asymptotically optimal algorithms that ensure storage balance at all times, even against an adversarial insertion and deletion of tuples. We combine the above algorithms with distributed routing structures to architect a P2P system that supports efficient range queries, while simultaneously guaranteeing storage balance.

Network-Aware Query Processing for Distributed Stream-Based Applications
Yanif Ahmad, Ugur Cetintemel (Brown Univ.)

This paper investigates the benefits of network awareness when processing queries in widely-distributed environments such as the Internet. We present algorithms that leverage knowledge of network characteristics (e.g., topology, bandwidth, etc.) when deciding on the network locations where the query operators are executed. Using a detailed emulation study based on realistic network models, we analyze and experimentally evaluate the proposed approaches for distributed stream processing. Our results quantify the significant benefits of the network-aware approaches and reveal the fundamental trade-off between bandwidth efficiency and result latency that arises in networked query processing.

Data Sharing Through Query Translation in Autonomous Sources
Anastasios Kementsietsidis, Marcelo Arenas (Univ. of Toronto)

Given a point q, a reverse k nearest neighbor (RkNN) query retrieves all the data points that have q as one of their k nearest neighbors. Existing methods for processing such queries have at least one of the following deficiencies: (i) they do not support arbitrary values of k (ii) they cannot deal efficiently with database updates, (iii) they are applicable only to 2D data (but not to higher dimensionality), and (iv) they retrieve only approximate results. Motivated by these shortcomings, we develop algorithms for exact processing of RkNN with arbitrary values of k on dynamic datasets of any dimensionality. Our methods utilize a conventional multi-dimensional index on the dataset and do not require any pre-computation. In addition to their flexibility, we experimentally verify that the proposed algorithms outperform the existing ones even in their restricted focus.


RESEARCH SESSION 12:STREAM DATA MANAGEMENT SYSTEMS

Linear Road: A Stream Data Management Benchmark
Arvind Arasu (Stanford Univ.), Mitch Cherniack, Eduardo Galvez (Brandeis Univ.), David Maier (OHSU/OGI), Anurag Maskey, Esther Ryvkina (Brandeis Univ.), Michael Stonebraker, Richard Tibbetts (MIT)

This paper specifies the Linear Road Benchmark for Stream Data Management Systems (SDMS). Stream Data Management Systems process streaming data by executing continuous and historical queries while producing query results in real-time. This benchmark makes it possible to compare the performance characteristics of SDMS' relative to each other and to alternative(e.g., relational) systems. Linear Road has been endorsed as an SDMS benchmark by both Aurora (out of Brandeis University, Brown University and MIT) and STREAM (out of Stanford University).Linear Road is inspired by the increasing prevalence of ``variable tolling'' on highway systems throughout the world. Variable tolling uses dynamically determined factors such as congestion levels and accident proximity to calculate toll charges. Linear Road specifies a variable tolling system for a fictional urban area including such features as accident detection and alerts, traffic congestion measurements, toll calculations and historical queries. After specifying the benchmark, we describe experimental results involving two implementations: one using a commercially available Relational Database Management System and the other using Aurora. Our results show that a dedicated Stream Data Management System can outperform a Relational Database Management System by at least a factor of 5 on streaming data applications.

Query Languages and Data Models for Database Sequences and Data Streams
Yan-Nei Law (UCLA), Haixun Wang (IBM T. J. Watson Research), Carlo Zaniolo (UCLA)

We study the fundamental limitations of relational algebra (RA)and SQL in supporting sequence and stream queries, and present effective query language and data model enrichments to deal with them. We begin by observing the well-known limitations of SQL in application domains which are important for data streams, such as sequence queries and data mining. Then we present a formal proof that, for continuous queries on data streams, SQL suffers from additional expressive power problems. We begin by focusing on the notion of non blocking (NB) queries that are the only continuous queries that can be supported on data streams. We characterize the notion of non blocking queries by showing that they are equivalent to monotonic queries. Therefore the notion of NB-completeness for RA can be formalized as its ability to express all monotonic queries expressible in RA using only the monotonic operators of RA. We show that RA is not NB-complete, and SQL is not more powerful than RA for monotonic queries. To solve these problems, we propose extensions that allow SQL to support all the monotonic queries expressible by a Turing machine using only monotonic operators. We show that these extensions are(i) user-defined aggregates (UDAs) natively coded in SQL (rather than in an external language), and (ii) a generalization of the union operator to support the merging of multiple streams according to their timestamps. These query language extensions require matching extensions to basic relational data model to support sequences explicitly ordered by timestamps. Along with the formulation of very powerful queries, the proposed extensions entail more efficient expressions for many simple queries. In particular, we show that non blocking queries are simple to characterize according to their syntactic structure.


RESEARCH SESSION 13:AUDITING

Tamper Detection in Audit Logs
Richard Snodgrass, Shilong Yao, Christian Collberg (Univ. of Arizona)

Audit logs are considered good practice for business systems, and are required by federal regulations for secure systems, drug approval data, medical information disclosure, financial records, and electronic voting. Given the central role of audit logs, it is critical that they are correct and inalterable. It is not sufficient to say, "our data is correct, because we store all interactions in a separate audit log." The integrity of the audit log itself must also be guaranteed. This paper proposes mechanisms within a database management system (DBMS), based on cryptographically strong one-way hash functions, that prevent an intruder, including an auditor or an employee or even an unknown bug within the DBMS itself, from silently corrupting the audit log. We propose that the DBMS store additional information in the database to enable a separate audit log validator to examine the database along with this extra information and state conclusively whether the audit log has been compromised. We show with an implementation on a high-performance storage engine that the overhead for auditing is low and that the validator can efficiently and correctly determine if the audit log has been compromised.

Auditing Compliance with a Hippocratic Database
Rakesh Agrawal, Roberto Bayardo, Christos Faloutsos, Jerry Kiernan, Ralf Rantzau, Ramakrishnan Srikant (IBM Almaden)

We introduce an auditing framework for determining whether a database system is adhering to its data disclosure policies. Users formulate audit expressions to specify the (sensitive) data subject to disclosure review. An audit component accepts audit expressions and returns all queries (deemed "suspicious") that accessed the specified data during their execution. The overhead of our approach on query processing is small, involving primarily the logging of each query string along with other minor annotations. Database triggers are used to capture updates in a backlog database. At the time of audit, a static analysis phase selects a subset of logged queries for further analysis. These queries are combined and transformed into an SQL audit query, which when run against the backlog database, identifies the suspicious queries efficiently and precisely. We describe the algorithms and data structures used in a DB2-based implementation of this framework. Experimental results reinforce our design choices and show the practicality of the approach.


RESEARCH SESSION 14:DATA WAREHOUSING

High-Dimensional OLAP: A Minimal Cubing Approach
Xiaolei Li, Jiawei Han, Hector Gonzalez (Univ. of Illinois at Urbana-Champaign)

Data cube has been playing an essential role in fast OLAP (online analytical processing) in many multi-dimensional data warehouses. However, there exist data sets in applications like bioinformatics, statistics, and text processing that are characterized by high dimensionality, e.g., over 100 dimensions, and moderate size, e.g., around 10^6 tuples. No feasible data cube can be constructed with such data sets. In this paper we will address the problem of developing an efficient algorithm to perform OLAP on such data sets.<p>Experience tells us that although data analysis tasks may involve a high dimensional space, most OLAP operations are performed only on a small number of dimensions at a time. Based on this observation, we propose a novel method that computes a thin layer of the data cube together with associated value-list indices. This layer, while being manageable in size, will be capable of supporting flexible and fast OLAP operations in the original high dimensional space. Through experiments we will show that the method has I/O costs that scale nicely with dimensionality and that are comparable to the costs of accessing a fully materialized data cube.

The Polynomial Complexity of Fully Materialized Coalesced Cubes
Yannis Sismanis, Nick Roussopoulos (Univ. of Maryland)

The data cube operator encapsulates all possible groupings of a dataset and has proved to be an invaluable tool in analyzing vast amountsof data. However its apparent exponential complexity has significantly limited its applicability to low dimensional datasets. Recently the idea of the coalesced cube was introduced, and showed that high-dimensional coalesced cubes are orders of magnitudes smaller in size than the original data cubes even when they calculate and store every possible aggregation with 100% precision. In this paper we present an analytical framework for estimating the size of coalesced cubes. By using this framework on uniform coalesced cubes we show that their size and the required computation time scales polynomially with the dimensionality of the data set and, therefore, a full data cube at 100% precision is not inherently cursed by high dimensionality. Additionally, we show that such coalesced cubes scale polynomially (and close to linearly) with the number of tuples on the dataset. We were also able to develop an efficient algorithm for estimating the size of coalesced cubes before actually computing them, based only on metadata about the cubes. Finally, we complement our analytical approach with an extensive experimental evaluation using real and synthetic data sets, and demonstrate that not only uniform but also zipfian and real coalesced cubes scale polynomially.


RESEARCH SESSION 15:LINK ANALYSIS

Relational Link-based Ranking
Floris Geerts (Univ. of Edinburgh), Heikki Mannila, Evimaria Terzi (Univ. of Helsinki)

Link analysis methods show that the interconnections between web pages have lots of valuable information. The link analysis methods are, however, inherently oriented towards analyzing binary relations. We consider the question of generalizing link analysis methods for analyzing relational databases. To this aim, we provide a generalized ranking framework and address its practical implications. More specifically, we associate with each relational database and set of queries a unique weighted directed graph, which we call the database graph. We explore the properties of database graphs. In analogy to link analysis algorithms, which use the Web graph to rank web pages, we use the database graph to rank partial tuples. In this way we can, e.g., extend the PageRank link analysis algorithm to relational databases and give this extension a random querier interpretation. Similarly, we extend the HITS link analysis algorithm to relational databases. We conclude with some preliminary experimental results.

ObjectRank: Authority-Based Keyword Search in Databases
Andrey Balmin (IBM Almaden), Vagelis Hristidis (Florida International Univ.), Yannis Papakonstantinou (UC San Diego)

The ObjectRank system applies authority-based ranking to keyword search in databases modeled as labeled graphs. Conceptually, authority originates at the nodes (objects) containing the keywords and flows to objects according to their semantic connections. Each node is ranked according to its authority with respect to the particular keywords. One can adjust the weight of global importance, the weight of each keyword of the query, the importance of a result actually containing the keywords versus being referenced by nodes containing them, and the volume of authority flow via each type of semantic connection. Novel performance challenges and opportunities are addressed. First, schemas impose constraints on the graph, which are exploited for performance purposes. Second, in order to address the issue of authority ranking with respect to the given keywords (as opposed to Google's global PageRank) we precompute single keyword ObjectRanks and combine them during run time. We conducted user surveys and a set of performance experiments on multiple real and synthetic datasets, to assess the semantic meaningfulness and performance of ObjectRank.

Combating Web Spam with TrustRank
Zoltan Gyongyi, Hector Garcia-Molina (Stanford Univ.), Jan Pedersen (Yahoo!, Inc.)

Web spam pages use various techniques to achieve higher-than-deserved rankings in a search engine's results. While human experts can identify spam, it is too expensive to manually evaluate a large number of pages. Instead, we propose techniques to semi-automatically separate reputable, good pages from spam. We first select a small set of seed pages to be evaluated by an expert. Once we manually identify the reputable seed pages, we use the link structure of the web to discover other pages that are likely to be good. In this paper we discuss possible ways to implement the seed selection and the discovery of good pages. We present results of experiments run on the World Wide Web indexed by AltaVista and evaluate the performance of our techniques. Our results show that we can effectively filter out spam from a significant fraction of the web, based on a good seed set of less than 200 sites.


RESEARCH SESSION 16:SENSORS, GRID, PUB/SUB SYSTEMS

Model-Driven Data Acquisition in Sensor Networks * BEST PAPER *
Amol Deshpande (UC Berkeley), Carlos Guestrin (Intel Research Berkeley), Samuel Madden (MIT and Intel Research Berkeley), Joseph Hellerstein (UC Berkeley and Intel Research Berkeley), Wei Hong (Intel Research Berkeley)

Declarative queries are proving to be anattractive paradigm for interacting withnetworks of wireless sensors. The metaphor that "the sensor net is a database" is problematic, however, because sensors do not exhaustively represent the data in the real world. In order to map the raw sensor readings onto physical reality, a "model" of that reality is required to complement the readings. In this paper, we enrich interactive sensor querying with statistical modeling techniques. We demonstrate that such models can help provide answers that are both more meaningful, and, by introducing approximations with probabilistic confidences, significantly more efficient to compute in both time and energy. Utilizing the combination of a model and live data acquisition raises the challenging optimization problem of selecting the best sensor readings to acquire, balancing the increase in the confidence of our answer against the communication and data acquisition costs in the network. We describe an exponential time algorithm for finding the optimal solution to this optimization problem, and a polynomial-time heuristic for identifying solutions that perform well in practice. We evaluate our approach on several real-world sensor-network data sets, taking into account the real measured data and communication quality, demonstrating that our model-based approach provides a high-fidelity representation of the real phenomena and leads to significant performance gains versus traditional data acquisition techniques.

GridDB: A Data-Centric Overlay for Scientific Grids
David Liu, Michael J. Franklin (UC Berkeley)

We present GridDB, a data-centric overlay for scientific grid data analysis. In contrast to currently deployed process-centric middleware, GridDB manages data entities rather than processes. GridDB provides a suite of services important to data analysis: a declarative interface, type-checking, interactive query processing, and memorization. We discuss several elements of GridDB:workflow/data model, query language, software architecture and query processing; and a prototype implementation. We validate GridDB by showing its modeling of real-world physics and astronomy analyses, and measurements on our prototype.

Towards an Internet-Scale XML Dissemination Service
Yanlei Diao, Shariq Rizvi, Michael J. Franklin (UC Berkeley)

Publish/subscribe systems have demonstrated the ability to scale to large numbers of users and high data rates when providing content-based data dissemination services on the Internet. However, their services are limited by the data semantics and query expressiveness that they support. On the other hand, the recent work on selective dissemination of XML data has made significant progress in moving from XML filtering to the richer functionality of transformation for result customization, but in general has ignored the challenges of deploying such XML-based services on an Internet-scale. In this paper, we address these challenges in the context of incorporating the rich functionality of XML data dissemination in a highly scalable system. We present the architectural design of ONYX, a system based on an overlay network. We identify the salient technical challenges in supporting XML filtering and transformation in this environment and propose techniques for solving them.


INDUSTRIAL SESSION 4: AUTOMATIC TUNING IN COMMERCIAL DVMSS

DB2 Design Advisor: Integrated Automatic Physical Database Design
Danny Zilio (IBM Toronto Lab), Jun Rao (IBM Almaden), Sam Lightstone (IBM Toronto Lab), Guy Lohman (IBM Almaden), Adam Storm, Christian, Garcia-Arellano (IBM Toronto Lab), Scott Fadden (IBM Portland)

The DB2 Design Advisor in IBM DB2 Universal Database (DB2 UDB) Version 8.2 for Linux, UNIX and Windows is a tool that, for a given workload, automatically recommends physical design features that are any subset of indexes, materialized query tables (also called materialized views), shared-nothing database partitionings, and multidimensional clustering of tables. Our work is the very first industrial-strength tool that covers the design of as many as four different features, a significant advance to existing tools, which support no more than just indexes and materialized views. Building such a tool is challenging, because of not only the large search space introduced by the interactions among features, but also the extensibility needed by the tool to support additional features in the future. We adopt a novel "hybrid" approach in the Design Advisor that allows us to take important interdependencies into account as well as to encapsulate design features as separate components to lower the reengineering cost. The Design Advisor also features a built-in module that automatically reduces the given workload, and therefore provides great scalability for the tool. Our experimental results demonstrate that our tool can quickly provide good physical design recommendations that satisfy users' requirements.

Automatic SQL Tuning in Oracle 10g
Benoit Dageville, Dinesh Das, Karl Dias, Khaled Yagoub, Mohamed Zait, Mohamed Ziauddin (Oracle Corp.)

SQL tuning is a very critical aspect of database performance tuning. It is an inherently complex activity requiring a high level of expertise in several domains: query optimization, to improve the execution plan selected by the query optimizer; access design, to identify missing access structures; and SQL design, to restructure and simplify the text of a badly written SQL statement. Furthermore, SQL tuning is a time consuming task due to the large volume and evolving nature of the SQL workload and its underlying data. In this paper we present the new Automatic SQL Tuning feature of Oracle 10g. This technology is implemented as a core enhancement of the Oracle query optimizer and offers a comprehensive solution to the SQL tuning challenges mentioned above. Automatic SQL Tuning introduces the concept of SQL profiling to transparently improve execution plans. It also generates SQL tuning recommendations by performing cost-based access path and SQL structure "what-if" analyses. This feature is exposed to the user through both graphical and command line interfaces. The Automatic SQL Tuning is an integral part of the Oracle's framework for self-managing databases. The superiority of this new technology is demonstrated by comparing the results of Automatic SQL Tuning to manual tuning using a real customer workload.

Database Tuning Advisor for Microsoft SQL Server 2005
Sanjay Agrawal, Surajit Chaudhuri, Lubor Kollar, Arun Marathe, Vivek Narasayya, Manoj Syamala (Microsoft Corp.)

The Database Tuning Advisor (DTA) that is part of Microsoft SQL Server 2005 is an automated physical database design tool that significantly advances the state-of-the-art in several ways. First, DTA is capable to providing an integrated physical design recommendation for horizontal partitioning, indexes, and materialized views. Second, unlike today's physical design tools that focus solely on performance, DTA also supports the capability for a database administrator (DBA) to specify manageability requirements while optimizing for performance. Third, DTA is able to scale to large databases and workloads using several novel techniques including: (a) workload compression (b) reduced statistics creation and (c) exploiting test server to reduce load on production server. Finally, DTA greatly enhances scriptability and customization through the use of a public XML schema for input and output. This paper provides an overview of DTA's novel functionality, the rationale for its architecture, and demonstrates DTA's quality and scalability on large customer workloads.


RESEARCH SESSION 17:TOP-K RANKING

Efficiency-Quality Tradeoffs for Vector Score Aggregation
Pavan Kumar C Singitham, Mahathi Mahabhashyam (Stanford Univ.), Prabhakar Raghavan (Verity Inc.)

Finding the l-nearest neighbors to a query in a vector space is an important primitive in text and image retrieval. Here we study an extension of this problem with applications to XML and image retrieval: we have multiple vector spaces, and the query places a weight on each space. Match scores from the spaces are weighted by these weights to determine the overall match between each record and the query; this is a case of score aggregation. We study approximation algorithms that use a small fraction of the computation of exhaustive search through all records, while returning nearly the best matches. We focus on the tradeoff between the computation and the quality of the results. We develop two approaches to retrieval from such multiple vector spaces. The first is inspired by resource allocation. The second, inspired by computational geometry, combines the multiple vector spaces together with all possible query weights into a single larger space. While mathematically elegant, this abstraction is intractable for implementation. We therefore devise an approximation of this combined space. Experiments show that all our approaches (to varying extents) enable retrieval quality comparable to exhaustive search, while avoiding its heavy computational cost.

Merging the Results of Approximate Match Operations
Sudipto Guha (Univ. of Pennsylvania), Nick Koudas, Amit Marathe, Divesh Srivastava (AT&T Labs-Research

Data Cleaning is an important process that has been at the center of research interest in recent years. An important end goal of effective data cleaning is to identify the relational tuple or tuples that are "most related" to a given query tuple. Various techniques have been proposed in the literature for efficiently identifying approximate matches to a query string against a single attribute of a relation. In addition to constructing a ranking (i.e., ordering) of these matches, the techniques often associate, with each match, scores that quantify the extent of the match. Since multiple attributes could exist in the query tuple, issuing approximate match operations for each of them separately will effectively create a number of ranked lists of the relation tuples. Merging these lists to identify a final ranking and scoring, and returning the top-K tuples, is a challenging task. In this paper, we adapt the well-known foot rule distance (for merging ranked lists) to effectively deal with scores. We study efficient algorithms to merge rankings, and produce the top-K tuples, in a declarative way. Since techniques for approximately matching a query string against a single attribute in a relation are typically best deployed in a database, we introduce and describe two novel algorithms for this problem and we provide SQL specifications for them. Our experimental case study, using real application data along with a realization of our proposed techniques on a commercial data base system, highlights the benefits of the proposed algorithms and attests to the overall effectiveness and practicality of our approach.

Top-k Query Evaluation with Probabilistic Guarantees
Martin Theobald, Gerhard Weikum, Ralf Schenkel (Max-Planck Institute of Computer Science)

Top-k queries based on ranking elements of multidimensional datasets are a fundamental building block for many kinds of information discovery. The best known general-purpose algorithm for evaluating top-k queries is Fagin's threshold algorithm (TA). Since the user's goal behind top-k queries is to identify one or a few relevant and novel data items, it is intriguing to use approximative variants of TA to reduce run-time costs. This paper introduces a family of approximative top-k algorithms based on probabilistic arguments. When scanning index lists of the underlying multidimensional data space in descending order of local scores, various forms of convolution and derived bounds are employed to predict when it is safe, with high probability, to drop candidate items and to prune the index scans. The precision and the efficiency of the developed methods are experimentally evaluated based on a large Web corpus and a structured data collection.


RESEARCH SESSION 18:DBMS ARCHITECTURE AND PERFORMANCE

STEPS Towards Cache-resident Transaction Processing
Stavros Harizopoulos, Anastassia Ailamaki (Carnegie Mellon Univ.)

Online transaction processing (OLTP) is a multi-billion dollar industry with high-end database servers employing state-of-the-art processors to maximize performance. Unfortunately, recent studies show that CPUs are far from realizing their maximum intended throughput because of delays in the processor caches. When running OLTP, instruction-related delays in the memory subsystem account for 25 to 40% of the total execution time. In contrast to data, instruction misses cannot be overlapped with out-of-order execution, and instruction caches cannot grow as the slower access time directly affects the processor speed. The challenge is to alleviate the instruction-related delays without increasing the cache size. We propose Steps, a technique that minimizes instruction cache misses in OLTP workloads by multiplexing concurrent transactions and exploiting common code paths. One transaction paves the cache with instructions, while close followers enjoy a nearly miss-free execution. Steps yields up to 96.7% reduction in instruction cache misses for each additional concurrent transaction, and at the same time eliminates up to 64% of mispredicted branches by loading a repeating execution pattern into the CPU. This paper (a) describes the design and implementation of Steps, (b) analyzes Steps using microbenchmarks, and (c) shows Steps performance when running TPC-C on top of the Shore storage manager.

Write-Optimized B-Trees
Goetz Graefe (Microsoft)

Large writes are beneficial both on individual disks and on disk arrays, e.g., RAID-5. The presented design enables large writes of internal B-tree nodes and leaves. It supports both in-place updates and large append-only ("log-structured") write operations within the same storage volume, within the same B-tree, and even at the same time. The essence of the design is to make page migration inexpensive, to migrate pages while writing them, and to make such migration optional rather than mandatory as in log-structured file systems. The inexpensive page migration also aids traditional defragmentation as well as consolidation of free space needed for future large writes. These advantages are achieved with a very limited modification to conventional B-trees that also simplifies other B-tree operations, e.g., key range locking and compression. Prior proposals and prototypes implemented transacted B-tree on top of log-structured file systems and added transaction support to log-structured file systems. Instead, the presented design adds techniques and performance characteristics of log-structured file systems to traditional B-trees and their standard transaction support, notably without adding a layer of indirection for locating B-tree nodes on disk. The result retains fine-granularity locking, full transactional ACID guarantees, fast search performance, etc. expected of a modern B-tree implementation, yet adds efficient transacted page relocation and large, high-bandwidth writes.

Cache-Conscious Radix-Decluster Projections
Stefan Manegold, Peter Boncz, Martin Kersten, Niels Nes (CWI)

As CPUs become more powerful with Moore's law and memory latencies stay constant, the impact of the memory access performance bottleneck continues to grow on relational operators like join, which can exhibit random access on a memory region larger than the hardware caches. While cache-conscious variants for various relational algorithms have been described, previous work has mostly ignored (the cost of) projection columns. However, real-life joins almost always come with projections, such that proper projection column manipulation should be an integral part of any generic join algorithm. In this paper, we analyze cache-conscious hash-join algorithms including projections on two storage schemes: N-ary Storage Model (NSM) and Decomposition Storage Model (DSM). It turns out, that the strategy of first executing the join and only afterwards dealing with the projection columns (i.e., post-projection) on DSM, in combination with a new finely tunable algorithm called Radix-Decluster, outperforms all previously reported projection strategies. To make this result generally applicable, we also outline how DSM Radix-Decluster can be integrated in a NSM-based RDBMS using projection indices.

Clotho: Decoupling Memory Page Layout from Storage Organization
Minglong Shao, Jiri Schindler, Steven Schlosser, Anastassia Ailamaki, Gregory Ganger (Carnegie Mellon Univ.)

As database application performance depends on the utilization of the memory hierarchy, smart data placement plays a central role in increasing locality and in improving memory utilization. Existing techniques, however, do not optimize accesses to all levels of the memory hierarchy and for all the different workloads, because each storage level uses different technology (cache, memory, disks) and each application accesses data using different patterns. Clotho is a new buffer pool and storage management architecture that decouples in-memory page layout from data organization on non-volatile storage devices to enable independent data layout design at each level of the storage hierarchy. Clotho can maximize cache and memory utilization by (a) transparently using appropriate data layouts in memory and non-volatile storage, and (b) dynamically synthesizing data pages to follow application access patterns at each level as needed. Clotho creates in-memory pages individually tailored for compound and dynamically changing workloads, and enables efficient use of different storage technologies (e.g., disk arrays or MEMS-based storage devices). This paper describes the Clotho design and prototype implementation and evaluates its performance under a variety of workloads using both disk arrays and simulated MEMS-based storage devices.


INDUSTRIAL SESSION 5: XML IMPLEMENTATIONS AUTOMATIC PHYSICAL DESIGN AND INDEXING

Query Rewrite for XML in Oracle XML DB
Muralidhar Krishnaprasad, Zhen Liu, Anand Manikutty, James W. Warner, Vikas Arora, Susan Kotsovolos (Oracle Corp.)

Oracle XML DB integrates XML storage and querying using the Oracle relational and object relational framework. It has the capability to physically store XML documents by shredding them as relational or object relational data, and creating logical XML documents using SQL/XML publishing functions. However, querying XML in a relational or object relational database poses several challenges. The biggest challenge is to efficiently process queries against XML in a database whose fundamental storage is table-based and whose fundamental query engine is tuple- oriented. In this paper, we present the 'XML Query Rewrite' technique used in Oracle XML DB. This technique integrates querying XML using XPath embedded inside SQL operators and SQL/XML publishing functions with the object relational and relational algebra. A common set of algebraic rules is used to reduce both XML and object queries into their relational equivalent. This enables a large class of XML queries over XML type tables and views to be transformed into their semantically equivalent relational or object relational queries. These queries are then amenable to classical relational optimizations yielding XML query performance comparable to relational. Furthermore, this rewrite technique lays out a foundation to enable rewrite of XQuery over XML.

Indexing XML Data Stored in a Relational Database
Shankar Pal, Istvan Cseri, Gideon Schaller, Oliver Seeliger, Leo Giakoumakis, Vasili Vasili Zolotov (Microsoft Corp.)

As XML usage grows for both data-centric and document-centric applications, introducing native support for XML data in relational databases brings significant benefits. It provides a more mature platform for the XML data model and serves as the basis for interoperability between relational and XML data. Whereas query processing on XML data shredded into one or more relational tables is well understood, it provides limited support for the XML data model. XML data can be persisted as a byte sequence (BLOB) in columns of tables to support the XML model more faithfully. This introduces new challenges for query processing such as the ability to index the XML blob for good query performance. This paper reports novel techniques for indexing XML data in the upcoming version of Microsoft® SQL Server™, and how it ties into the relational framework for query processing.

Automated Statistics Collection in DB2 UDB
Ashraf Aboulnaga, Peter Haas (IBM Almaden), Mokhtar Kandil, Sam Lightstone (IBM Toronto Development Lab), Guy Lohman, Volker Markl (IBM Almaden), Ivan Popivanov (IBM Toronto Development Lab), Vijayshankar Raman (IBM Almaden)

The use of inaccurate or outdated database statistics by the query optimizer in a relational DBMS often results in a poor choice of query execution plans and hence unacceptably long query processing times. Configuration and maintenance of these statistics has traditionally been a time-consuming manual operation, requiring that the database administrator (DBA) continually monitor query performance and data changes in order to determine when to refresh the statistics values and when and how to adjust the set of statistics that the DBMS maintains. In this paper we describe the new Automated Statistics Collection (ASC) component of DB2 Stinger. This autonomic technology frees the DBA from the tedious task of manually supervising the collection and maintenance of database statistics. ASC monitors both the update-delete-insert (UDI) activities on the data as well as query feedback (QF), i.e., the results of the queries that are executed on the data. ASC uses these two sources of information to automatically decide which statistics to collect and when to collect them. This combination of UDI-driven and QF-driven autonomic processes ensures that the system can handle unforeseen queries while also ensuring good performance for frequent and important queries. We present the basic concepts, architecture, and key implementation details of ASC in DB2 Stinger, and present a case study showing how the use of ASC can speed up a query workload by orders of magnitude without requiring any DBA intervention.

High Performance Index Build Algorithms for Intranet Search Engines
Marcus Fontoura, Eugene Shekita, Jason Zien, Sridhar Rajagopalan, Andreas Neumann (IBM Almaden)

There has been a substantial amount of research on high-performance algorithms for constructing an inverted text index. However, constructing the inverted index in a intranet search engine is only the final step in a more complicated index build process. Among other things, this process requires an analysis of all the data being indexed to compute measures like PageRank. The time to perform this global analysis step is significant compared to the time to construct the inverted index, yet it has not received much attention in the research literature. In this paper, we describe how the use of slightly outdated information from global analysis and a fast index construction algorithm based on radix sorting can be combined in a novel way to significantly speed up the index build process without sacrificing search quality.

Automated Design of Multi-Dimensional Clustering Tables in Relational Databases
Sam Lightstone (IBM Toronto Laboratory), Bishwaranjan Bhattacharjee (IBM T.J. Watson Research Center)