Benchmarking Declarative Approximate Selection Predicates

Benchmarking Declarative Approximate Selection Predicates
2007 ACM SIGMOD International Conference on Management of Data

This work is part of my masters thesis at the University of Toronto.
This is a joint work with Mohammad Sadoghi, Amit Chandel, Nick Koudas and Divesh Srivastava

.PDF

ABSTRACT
Declarative data quality has been an active research topic. The fundamental principle behind a declarative approach to data quality is the use of declarative statements to realize data quality primitives on top of any relational data source. A primary advantage of such an approach is the ease of use and integration with existing applications.
Over the last couple of years several similarity predicates have been proposed for common quality primitives (approximate selections, joins, etc) and have been fully expressed using declarative SQL statements. In this paper we propose new similarity predicates along with their declarative realization, based on notions of probabilistic information retrieval. In particular we show how language models and hidden Markov models can be utilized as similarity predicates for data quality and present their full declarative instantiation. We also show how other scoring methods from information retrieval, can be utilized in a similar setting. We then, present full declarative specifications of previously proposed similarity predicates in the literature, grouping them into classes according to their primary characteristics. Finally, we present a thorough performance and accuracy study comparing a large number of similarity predicates for data cleaning operations. We quantify both their runtime performance as well as their accuracy for several types of common quality problems encountered in operational databases.

DATASETS
Data Generator: s-dbgen which is a modified version of dbgen.
Some of the datasets used in the paper: datasets.zip

Oktie Hassanzadeh - oktie at cs.toronto.edu