Accuracy of Approximate String Joins Using Grams
5th International Workshop on Quality in Databases at VLDB
September 23, 2007, Vienna, Austria
This work is done in the database research group at the University of
Toronto.
This is a joint work with Mohammad Sadoghi and Renée J. Miller
ABSTRACT
Approximate join is an important part of many data cleaning and
integration methodologies. Various similarity measures have been
proposed for accurate and efficient matching of string attributes. The
accuracy of the similarity measures highly depends on the
characteristics of the data such as the amount and type of the errors
and length of the strings. Recently, there has been an increasing
interest in using methods based on q-grams (substrings of length $q$)
made out of the strings, mainly due to their high efficiency. In this
work, we evaluate the accuracy of the similarity measures used in these
methodologies. We present an overview of several similarity measures
based on q-grams. We then thoroughly compare their accuracy on several
datasets with different characteristics. Since the efficiency of
approximate joins depends on the similarity threshold they use, we
study how the value of the threshold (including values used in recent
performance studies) affects the accuracy of the join. We also compare
different measures based on the highest accuracy they can achieve on
different datasets.
DATASETS
Data Generator: s-dbgen which is a modified version of dbgen.
Some of the datasets used in the paper: datasets.zip
-- Plesae send me an email if you need a copy of the paper --
Oktie Hassanzadeh - oktie at cs.toronto.edu