UofT Database Group LogoUofT Database Group Logo
UofT Database Group LogoUofT Database Group Logo

2001 Database Seminar Series

October 23, 2001 Wenfei Fan, Bell Laboratories
November 13, 2001 Leopoldo Bertossi, Carleton University
January 11, 2002 Philip Bernstein, Microsoft Research
January 22, 2002 H. V. Jagadish, University of Michigan
February 12, 2002 Christos Papadimitriou, University of California, Berkeley
March 6, 2002 Todd C. Mowry, Carnegie Mellon University
March 21, 2002 Zack Ives, University of Washington
April 11, 2002 Nick Koudas, AT&T Research Lab
May 29, 2002 Frank Neven, University of Limburg, Belgium
May 29, 2002 Alberto H. F. Laender, Federal University of Minas Gerais, Brazil
July 4, 2002 Volker Markl, IBM Almaden Research Center

[Return to DB Group Seminar Page]

Tuesday, October 23, 2001

Title: Keys for XML
Speaker: Wenfei Fan
Temple University
Abstract: Key constraints have proved valuable and effective to database management in semantic specification, query optimization, data integration, and choice of efficient storage and access methods. The importance of key constraints is also being recognized for XML. In this talk, we describe certain requirements on key specifications for XML, examine existing specifications, and present a new proposal for specifying XML keys. We give preliminary complexity results for reasoning about XML keys, and show that XML key analysis is far more intriguing than its relational counterpart. Finally, we look at several alternative specifications for XML keys.

This is joint work with Peter Buneman from UPenn.

Bio: Dr. Wenfei Fan is an assistant professor in the Department of Computer and Information Sciences at Temple University, Philadelphia, PA. Dr. Fan's research interests include Web databases, XML data management, data integration, electronic commerce, digital libraries, database programming languages, logic and computation. His recent research has focused on integrity constraints for XML and on semantic specifications of semistructured data. His work is being widely cited by database researchers. It has also been incorporated into syllabi for graduate level courses at several universities, and has been included in database textbooks. He is a member of ACM/ACM SIGMOD, and an awardee of NSF CAREER Award. He reviews for a number of prestigious journals such as Journal of Computer and System Sciences (JCSS), and serves on the program committees of several international database conferences. Dr. Fan received his B.S. and M.S. in Computer Science from Peking University, and a Ph.D. in Computer and Information Science from the University of Pennsylvania. He is also affiliated with the Internet Research Department, Bell Laboratories.
  • Location: PT378
  • Time: 12:00AM
[Return to Toronto DB Seminar Index]



Tuesday, November 13, 2001

Title: Querying Inconsistent Databases
Speaker: Leopoldo Bertossi
Carleton University
Abstract: Databases may become inconsistent with respect to a given set of integrity constraints. Nevertheless, their data is mostly consistent, even when the database as a whole is inconsistent. In this talk we review several research results in relation to consistent data in inconsistent databases:
(a) Characterization of consistent answers to a query (this definition is based on the class of minimal repairs of the database).
(b) Computational mechanisms to obtain consistent answers.
(c) Consistent answers to aggregate queries with functional dependencies.
(d) Logical specifications of the class of database repairs and their application to consistent query answering.
(e) Consistent answers in data integration.

Bio: Dr. Leopoldo Bertossi is a full professor of computer science at Carleton University, Ottawa. His research centers on database systems, intelligent information systems, data warehousing, knowledge representation in AI, computational logic, logic programming, foundations of probability and statistical methods. Dr. Bertossi received his Ph.D. from PUC, Chile in 1988.
  • Location: PT378
  • Time: 12:00AM
[Return to Toronto DB Seminar Index]



Friday, January 11, 2002

Title: A Model for Peer-to-Peer Database Systems
Speaker: Philip Bernstein
Microsoft Research
Abstract: We motivate with examples the need for a new data model intended to support data management within a peer-to-peer setting. We also outline a formalization of the data model in terms of a model and a proof theory and prove the latter sound and complete with respect to the former. We also sketch a generalization of an earlier result due to Reiter which characterizes a set of peer databases as a multi-context system.

This work is in collaboration with Fausto Giunghilia and Luciano Serafini of IRST (Trento, Italy) and John Mylopoulos.

Bio: Philip Bernstein is a Senior Researcher in the database group of Microsoft Research and a contributor to the Microsoft Repository product group, where he was Architect from 1994-1998. He has published over 90 articles and 3 books on database systems and related topics, and has contributed to many database system products, prototypes, and standards. His latest book is "Principles of Transaction Processing" (Morgan Kaufmann Publishers, 1996). He received his Ph.D. from University of Toronto in 1975.
  • Location: GB202
  • Time: 2:00PM
[Return to Toronto DB Seminar Index]



Tuesday, January 22, 2002

Title: TIMBER: A Native XML Database Management System
Speaker: H. V. Jagadish
University of Michigan
Abstract: The eXtensible Markup Language (XML) has recently become very popular as a representation format for a wide variety of data. Large repositories of XML data have begun to emerge. The effective management of XML in a database thus becomes a pressing issue. A central challenge in this regard is the complex and heterogeneous structure of XML data. In this talk, I will discuss the design of TIMBER, a native XML database management system that we are building at the University of Michigan.

In the TIMBER project we are exploring the issues involved in storing XML in native format. We recognize XML documents to be trees, and build a system to manipulate collections of trees. In doing so, we attempt to avoid the pitfall of "instance-at-a-time" navigational access. Rather, we attempt to bring to bear the core ideas of database technology, such as declarative querying, a bulk algebra, and cost-based query optimization.

Bio: H. V. Jagadish is a Professor of Computer Science and Engineering at the University of Michigan in Ann Arbor. After earning his PhD from Stanford in 1985, he spent over a decade at AT&T Bell Laboratories in Murray Hill, N.J., eventually becoming head of AT&T Labs database research department at the Shannon Laboratory in Florham Park, N.J. He has also served as a Professor at the University of Illinois in Urbana-Champaign.

Professor Jagadish is well-known for his broad-ranging research on databases, and has over 80 major papers and 20 patents. He is currently the founding editor of the ACM SIGMOD Digital Review. Among many professional positions he has held, he has previously been an Associate Editor for the ACM Transactions on Database Systems (1992-1995) and Program Chair of the ACM SIGMOD annual conference (1996).

  • Location: PT378
  • Time: 2:00PM
Roundtable: Meet and talk with the speaker at an informal roundtable discussion.
  • Location: PT378
  • Time: 9:45AM
[Return to Toronto DB Seminar Index]



Tuesday, February 12, 2002

Title: Algorithmic Problems Related to the Internet
Speaker: Christos Papadimitriou
University of California, Berkeley
Abstract: The Internet has surpassed the computer as the most complex computational artifact, and it is becoming an important object of study for Theoretical Computer Science. In fact, it is the first such artifact that was not deliberately designed, and thus it must be approached very much like a mysterious physical phenomenon (or behavior) to be understood. Since the Internet is built, operated and used by independent entities in various and varying degrees of competition, Mathematical Economics and Game Theory are obviously pertinent to this study. Finally, the Internet supports access to information of unprecedented scale and availability. We survey some results, with several collaborators, exemplifying these lines of research.

Bio: Christos Papadimitriou is C. Lester Hogan professor of electrical engineering and computer science and acting chair of the Department of Electrical Engineering and Computer Sciences at the University of California at Berkeley. He has a bachelor's degree from Athens Polytechnic and a doctoral degree from Princeton University. He has taught at Harvard University, the Massachusetts Institute of Technology, Athens Polytechnic, Stanford University, and the University of California at San Diego. Papadimitriou has been at Berkeley since 1995. He has written five books, and more than 200 articles on algorithms and complexity, and their applications to various fields, including databases, optimization, artificial intelligence, the life sciences, and economics.
  • Location: SF 1105
  • Time: 11:00Am
[Return to Toronto DB Seminar Index]



Wednesday, March 6, 2002

Title: Improving Index Performance through Prefetching
Speaker: Todd C. Mowry
Carnegie Mellon University
Abstract: This talk proposes and evaluates "Prefetching B+-Trees" (pB+-Trees), which use prefetching to accelerate two important operations on B+-Tree indices: searches and range scans. To accelerate searches, pB+-Trees use prefetching to effectively create wider nodes than the natural data transfer size: e.g., eight vs. one cache lines or disk pages. These wider nodes reduce the height of the B+-Tree, thereby decreasing the number of expensive misses when going from parent to child without significantly increasing the cost of fetching a given node. Our results show that this technique speeds up search and update times by a factor of 1.2-1.5 for main-memory B+-Trees. In addition, it outperforms and is complementary to "Cache-Sensitive B+-Trees." To accelerate range scans, pB+-Trees provide arrays of pointers to their leaf nodes. These allow the pB+-Tree to prefetch arbitrarily far ahead, even for nonclustered indices, thereby hiding the normally expensive cache misses associated with traversing the leaves within the range. Our results show that this technique yields over a sixfold speedup on range scans of 1000+ keys. Although our experimental evaluation focuses on main memory databases, the techniques that we propose are also applicable to hiding disk latency.

Joint work with Shimin Chen and Phillip Gibbons.

Bio: Todd C. Mowry is an Associate Professor in the School of Computer Science at Carnegie Mellon University. He received an M.S.E.E. and Ph.D. from Stanford University in 1989 and 1994, respectively. From 1994 through 1997, he was an Assistant Professor in the ECE and CS departments at the University of Toronto prior to joining Carnegie Mellon University in July, 1997. The goal of Professor Mowry's research is is to develop new techniques for designing computer systems (both hardware and software) such that they can achieve dramatic performance breakthroughs at low cost without placing any additional burden on the programmer. Specifically, he has been focusing on two areas: (i) automatically tolerating the ever-increasing relative latencies of accessing and communicating data (via DRAM, disks, and networks) which threaten to nullify any other improvements in processing efficiency; and (ii) automatically extracting thread-level parallelism from important classes of applications where this is currently not possible. He currently leads the STAMPede and Profet projects, and is a co-leader of the Cache-Resident Database (CRDB) project. He recently received a Sloan Research Fellowship and the TR100 Award from MIT's Technology Review magazine .
  • Location: PT266
  • Time: 3:00AM
[Return to Toronto DB Seminar Index]



Thursday, March 21, 2002

Title: Efficient Query Processing for Data Integration
Speaker: Zack Ives
University of Washington
Abstract: There are two key challenges posed by data integration query processing. First, very little is known about the data sources, but query processors rely on statistics about the data when choosing a "plan" to use in executing a query. Second, data integration applications need to process XML since it has become the standard format for data interchange. Current techniques for processing XML do not suffice for data integration because they do not produce initial answers quickly enough, particularly for data being streamed across a network.

To address the first problem, I have developed convergent query processing, which establishes a feedback loop between query execution and optimization: the system monitors actual query plan performance and uses this new knowledge to re-estimate the cost of alternative query plans. At any point, execution can be stopped and a more promising plan can be started in mid-stream. Convergent query processing not only addresses the problem of limited knowledge in data integration, but it can also benefit traditional databases. To address the second data integration challenge -- returning initial answers quickly for XML queries -- I have developed an XML query processing architecture that incrementally provides results as data is read across the network. Combined with convergent query processing, this XML architecture provides good performance for both initial and final answers.

Bio: Zack Ives is a graduating Ph.D. student at the University of Washington, and will be joining the University of Pennsylvania in December. His research interests are primarily centered in the area of data management for heterogeneous data. His dissertation is on the topics of data integration and XML.
  • Location: GB202
  • Time: 11:00AM
[Return to Toronto DB Seminar Index]



Thursday, April 11, 2002

Title: Join Operations for Data Cleaning
Speaker: Nick Koudas
AT&T Research Lab
Abstract: The problem of data cleaning, which consists of removing inconsistencies and errors from data sources is well known in many areas of information management. In this talk we will consider join operations suitable for data cleaning tasks. We will first review state of the art techniques for cleaning relational data sources through join operations using suitably defined notions of approximate match as a join predicate. We will then study the problem of cleaning XML data sources through correlations realized as join operations.

A challenging aspect of such operations in the XML domain is the document structure. Two documents might convey approximately or exactly the same information but may be quite different in structure. Consequently approximate match in structure, in addition to content, has to be folded in the join operation. We quantify approximate match in structure and content using well defined notions of distance. For structure, we propose computationally inexpensive ways to reason about the tree edit distance metric between two trees. We then show how the tree edit distance, and other metrics that quantify distance between trees, can be incorporated in a join framework. We introduce the notion of reference sets to facilitate this operation. Intuitively, a reference set consists of data elements used to project the data space. We characterize what constitutes a good choice of a reference set and propose techniques to identify them. This gives rise to a variety of approaches for the problem, which we formulate and analyze. We demonstrate the practical utility of our solutions using large collections of real XML data sets.

Bio: Nick Koudas is a researcher at AT&T Shannon Laboratory and a member of the database research department. He received a BS degree from the Computer Engineering and Informatics Department at the University of Patras, Greece in 1992 a MS degree from the University of Maryland at College Park in 1994, and a PhD degree from the University of Toronto in 1998. His research interests are in the area of data integration, data warehousing, data mining, database techniques for metadata management (XML).
  • Location: GB202
  • Time: 11:00AM
[Return to Toronto DB Seminar Index]



Wednesday, May 29, 2002

Title: XML and Tree-walking (from practice to theory)
Speaker: Frank Neven
University of Limburg, Belgium
Abstract: The advent of XML added a new component to the popular logic and databases connection: tree automata. One instance of such automata are tree-walking ones. Tree-walking is a well-known paradigm from formal language theory studied in the context of attribute grammars and tree transformations. This paradigm materialized in XML research in various ways: as a pattern-language, as an abstract model of XML transformations, in the context of XML streaming, and as an abstraction of XSLT). In this talk, we discuss the expressiveness of the tree-walking paradigm in two scenario's: XML modeled with and without attributes. In particular, we discuss the relationship with first-order and monadic second-order logic. Some proof techniques are based on communication complexity.

Bio: Frank Neven is an assistant professor at the University of Limburg, Belgium. His research interests are in the area of database theory, foundations of XML, formal languages and automata.
  • Location: PT378
  • Time: 2:00PM
[Return to Toronto DB Seminar Index]



Wednesday, May 29, 2002

Title: An Example-based Approach to Extracting Semistructured Web Data
Speaker: Alberto H. F. Laender
Federal University of Minas Gerais, Brazil
Abstract: In this talk we present DEByE - Data Extraction By Example, an approach to extracting data from Web sources that is based examples specified by the user. The novelty of our approach is the fact that the user specifies examples according to a structure of his liking and that this structure is described at example specification time. For the specification of the examples, the user interacts with a tool we developed that adopts nested tables as its visual paradigm. Nested tables are simple, intuitive, and allow shielding the user from technical details (such as HTML tags, formatting operators, and learning automata) related to the extraction problem. The examples provided by the user are then used to generate patterns which allow extracting data from new documents. For the extraction, DEByE adopts a new bottom-up procedure which is very effective with various Web sources, as demonstrated by our experiments.

Bio: Alberto H. F. Laender received a B.S. degree in Electrical Engineering (1974) and an M.Sc. degree in Computer Science (1979), both from the Universidade Federal de Minas Gerais (UFMG), Belo Horizonte, Brazil, and a Ph.D. degree in Computing (1984) from the University of East Anglia, Norwich, UK. He joined the UFMG Computer Science Department in 1975, where he is currently a Professor and the head of the Database Research Group.
  • Location: PT378
  • Time: 4:00PM
[Return to Toronto DB Seminar Index]



Thursday, July 4, 2002

Title: LEO - DB2's LEarning Optimizer
Speaker: Volker Markl
IBM Almaden Research Center
Abstract: Most modern DBMS optimizers rely upon a cost model to choose the best query execution plan (QEP) for any given query. Cost estimates are heavily dependent upon the optimizer's estimates for the number of rows that will result at each step of the QEP for complex queries involving many predicates and/or operations. These estimates, in turn, rely upon statistics on the database and modeling assumptions that may or may not be true for a given database. In this talk, we present research on a DB2 prototype that has been carried out at the IBM Almaden Research Center. We introduce LEO, DB2's LEarning Optimizer, as a comprehensive way to repair incorrect statistics and cardinality estimates of a query execution plan. By monitoring previously executed queries, LEO compares the optimizer's estimates with actuals at each step in a QEP, and computes adjustments to cost estimates and statistics that may be used during future query optimizations. This analysis can be done either on-line or off-line on a separate syst em, and either incrementally or in batches. In this way, LEO introduces a feedback loop to query optimization that enhances the available information on the database where the most queries have occurred, allowing the optimizer to actually learn from its past mistakes. Our technique is general and can be applied to any operation in a QEP (not just selection predicates on base tables), including joins, derived results after several predicates have been applied, and even to DISTINCT and GROUP-BY operators. As shown by performance measurements on a 10 GB TPC-H data set, the runtime overhead of LEO's monitoring is insignificant, whereas the potential benefit to response time from more accurate cardinality and cost estimates can be orders of magnitude.

Bio: Volker Markl is a graduate of the Technische Universität München, where he earned a Masters degree in Computer Science in 1995. He completed his PhD in 1999 under the supervision of Rudolf Bayer. His dissertation on "Relational Query Processing Using a Multidimensional Access Technique" was honored "with distinction" by the German Computer Society (Gesellschaft für Informatik). He also earned a degree in Business Administration from the University Hagen, Germany in 1995.

Dr. Markl has been working at IBM's Almaden Research Center in San Jose, USA since January, 2001, conducting research in query optimization, indexing, and self-managing databases. Volker Markl is spearheading the LEO project, an effort on autonomic computing with the goal to create a self-tuning optimizer for DB2 UDB. He also is the Almaden chair for the IBM Data Management Professional Interest Community (PIC).
  • Location: PT378
  • Time: 11:00AM
[Return to Toronto DB Seminar Index]

Last modified: Wed May 29 12:02:56 EST 2002