Workshop on Enabling Real-Time for Business Intelligence

August 24, 2009 - Lyon, France


Keynote Speech 2

Queries over unstructured data: probabilistic methods to the rescue



Unstructured data like emails, addresses, invoices, call transcripts, reviews, and press releases are now an integral part of any large enterprise. A crucial challenge of modern business intelligence applications is analyzing and querying data seamlessly across structured and unstructured sources. This requires the development of automated techniques for extracting structured records from text sources and resolving entity mentions in data from various sources. The success of any automated method for extraction and integration depends on how effectively it unifies diverse clues in the unstructured source and in existing structured databases. We argue that statistical learning techniques like Conditional Random Fields (CRFs) provide a accurate, elegant and principled framework for tackling these tasks. Given the inherent noise in real-world sources, it is important to capture the uncertainty of the above operations via imprecise data models. CRFs provide a sound probability distribution over extractions but are not easy to represent and query in a relational framework. We present methods of approximating this distribution to query-friendly row and column uncertainty models. Finally, we present models for representing the uncertainty of de-duplication and algorithms for various Top-K count queries on imprecise duplicates.


About the speaker

Sunita Sarawagi researches in the fields of databases, data mining, and machine learning. Her current research interests are information integration, graphical and structured models, and probabilistic databases. She is associate professor at IIT Bombay. Prior to that she was a research staff member at IBM Almaden Research Center. She got her PhD in databases from the University of California at Berkeley and a bachelors degree from IIT Kharagpur. She has several publications in databases and data mining and several patents. She serves on the board of directors of ACM SIGKDD and VLDB foundation. She was program chair for the ACM SIGKDD 2008 conference and has served as program committee member for SIGMOD, VLDB, SIGKDD, ICDE, and ICML conferences. She is on the editorial board of the ACM TODS, ACM TKDD, and FnT for machine learning journals.