Benchmarking Declarative Approximate
Selection Predicates
2007
ACM SIGMOD International Conference on Management of Data
This work is part of my masters thesis at the University of
Toronto.
This is a joint work with Mohammad Sadoghi, Amit Chandel,
Nick Koudas and Divesh Srivastava
.PDF
ABSTRACT
Declarative data quality has been an active research topic.
The
fundamental principle behind a declarative approach to data quality is
the use of declarative statements to realize data quality primitives
on top of any relational data source. A primary advantage of such
an approach is the ease of use and integration with existing
applications.
Over the last couple of years several similarity predicates
have
been proposed for common quality primitives (approximate selections,
joins, etc) and have been fully expressed using declarative
SQL statements. In this paper we propose new similarity predicates
along with their declarative realization, based on notions of
probabilistic information retrieval. In particular we show how language
models and hidden Markov models can be utilized as similarity
predicates for data quality and present their full declarative
instantiation. We also show how other scoring methods from information
retrieval, can be utilized in a similar setting. We then,
present full declarative specifications of previously proposed
similarity predicates in the literature, grouping them into classes
according to their primary characteristics. Finally, we present a
thorough performance and accuracy study comparing a large number
of similarity predicates for data cleaning operations. We quantify
both their runtime performance as well as their accuracy for several
types of common quality problems encountered in operational
databases.
DATASETS
Data Generator: s-dbgen which is a modified version of dbgen.
Some of the datasets used in the paper: datasets.zip
Oktie Hassanzadeh - oktie at cs.toronto.edu