


| October 23, 2001 | Wenfei Fan, Bell Laboratories |
| November 13, 2001 | Leopoldo Bertossi, Carleton University |
| January 11, 2002 | Philip Bernstein, Microsoft Research |
| January 22, 2002 | H. V. Jagadish, University of Michigan |
| February 12, 2002 | Christos Papadimitriou, University of California, Berkeley |
| March 6, 2002 | Todd C. Mowry, Carnegie Mellon University |
| March 21, 2002 | Zack Ives, University of Washington |
| April 11, 2002 | Nick Koudas, AT&T Research Lab |
| May 29, 2002 | Frank Neven, University of Limburg, Belgium |
| May 29, 2002 | Alberto H. F. Laender, Federal University of Minas Gerais, Brazil |
| July 4, 2002 | Volker Markl, IBM Almaden Research Center |
| Title: | Keys for XML |
| Speaker: |
Wenfei Fan
Temple University |
| Abstract: |
Key constraints have proved valuable and effective to database management
in semantic specification, query optimization, data integration, and
choice of efficient storage and access methods. The importance of key
constraints is also being recognized for XML. In this talk, we describe
certain requirements on key specifications for XML, examine existing
specifications, and present a new proposal for specifying XML keys. We
give preliminary complexity results for reasoning about XML keys, and show
that XML key analysis is far more intriguing than its relational
counterpart. Finally, we look at several alternative specifications for
XML keys.
This is joint work with Peter Buneman from UPenn.
|
| Bio: |
Dr. Wenfei Fan is an assistant professor in the Department of Computer and
Information Sciences at Temple University, Philadelphia, PA. Dr. Fan's
research interests include Web databases, XML data management, data
integration, electronic commerce, digital libraries, database programming
languages, logic and computation. His recent research has focused on
integrity constraints for XML and on semantic specifications of
semistructured data. His work is being widely cited by database
researchers. It has also been incorporated into syllabi for graduate level
courses at several universities, and has been included in database
textbooks. He is a member of ACM/ACM SIGMOD, and an awardee of NSF CAREER
Award. He reviews for a number of prestigious journals such as Journal of
Computer and System Sciences (JCSS), and serves on the program committees
of several international database conferences. Dr. Fan received his B.S.
and M.S. in Computer Science from Peking University, and a Ph.D. in
Computer and Information Science from the University of Pennsylvania. He
is also affiliated with the Internet Research Department, Bell
Laboratories.
|
| Title: | Querying Inconsistent Databases |
| Speaker: |
Leopoldo Bertossi
Carleton University |
| Abstract: |
Databases may become inconsistent with respect to a given set of
integrity constraints. Nevertheless, their data is mostly consistent,
even when the database as a whole is inconsistent. In this talk we
review several research results in relation to consistent data in
inconsistent databases:
(a) Characterization of consistent answers to a query (this definition is based on the class of minimal repairs of the database). (b) Computational mechanisms to obtain consistent answers. (c) Consistent answers to aggregate queries with functional dependencies. (d) Logical specifications of the class of database repairs and their application to consistent query answering. (e) Consistent answers in data integration. |
| Bio: |
Dr. Leopoldo Bertossi is a full professor of computer science at Carleton
University, Ottawa. His research centers on database systems, intelligent
information systems, data warehousing, knowledge representation in AI,
computational logic, logic programming, foundations of probability and
statistical methods. Dr. Bertossi received his Ph.D. from PUC, Chile in
1988.
|
| Title: | A Model for Peer-to-Peer Database Systems |
| Speaker: |
Philip Bernstein
Microsoft Research |
| Abstract: |
We motivate with examples the need for a new data model intended to
support data management within a peer-to-peer setting. We also outline
a formalization of the data model in terms of a model and a proof
theory and prove the latter sound and complete with respect to the
former. We also sketch a generalization of an earlier result due to
Reiter which characterizes a set of peer databases as a multi-context
system.
This work is in collaboration with Fausto Giunghilia and Luciano
Serafini of IRST (Trento, Italy) and John Mylopoulos.
|
| Bio: |
Philip Bernstein is a Senior Researcher in the database group of Microsoft
Research and a contributor to the Microsoft Repository product group,
where he was Architect from 1994-1998. He has published over 90 articles
and 3 books on database systems and related topics, and has contributed to
many database system products, prototypes, and standards. His latest book
is "Principles of Transaction Processing" (Morgan Kaufmann Publishers,
1996). He received his Ph.D. from University of Toronto in 1975.
|
| Title: | TIMBER: A Native XML Database Management System |
| Speaker: |
H. V. Jagadish
University of Michigan |
| Abstract: |
The eXtensible Markup Language (XML) has recently become very popular as a
representation format for a wide variety of data. Large repositories of
XML data have begun to emerge. The effective management of XML in a
database thus becomes a pressing issue. A central challenge in this regard
is the complex and heterogeneous structure of XML data. In this talk, I
will discuss the design of TIMBER, a native XML database management system
that we are building at the University of Michigan.
In the TIMBER project we are exploring the issues involved in storing XML
in native format. We recognize XML documents to be trees, and build a
system to manipulate collections of trees. In doing so, we attempt to
avoid the pitfall of "instance-at-a-time" navigational access. Rather, we
attempt to bring to bear the core ideas of database technology, such as
declarative querying, a bulk algebra, and cost-based query optimization.
|
| Bio: |
H. V. Jagadish is a Professor of Computer Science and Engineering at the
University of Michigan in Ann Arbor. After earning his PhD from Stanford
in 1985, he spent over a decade at AT&T Bell Laboratories in Murray Hill,
N.J., eventually becoming head of AT&T Labs database research department
at the Shannon Laboratory in Florham Park, N.J. He has also served as a
Professor at the University of Illinois in Urbana-Champaign.
Professor Jagadish is well-known for his broad-ranging research on databases, and has over 80 major papers and 20 patents. He is currently the founding editor of the ACM SIGMOD Digital Review. Among many professional positions he has held, he has previously been an Associate Editor for the ACM Transactions on Database Systems (1992-1995) and Program Chair of the ACM SIGMOD annual conference (1996).
|
| Roundtable: |
Meet and talk with the speaker at an informal roundtable
discussion.
|
| Title: | Algorithmic Problems Related to the Internet |
| Speaker: |
Christos Papadimitriou
University of California, Berkeley |
| Abstract: |
The Internet has surpassed the computer as the most complex
computational artifact, and it is becoming an important object of study
for Theoretical Computer Science. In fact, it is the first such
artifact that was not deliberately designed, and thus it must be approached
very much like a mysterious physical phenomenon (or behavior) to be
understood.
Since the Internet is built, operated and used by independent entities
in various and varying degrees of competition, Mathematical Economics and
Game Theory are obviously pertinent to this study. Finally, the Internet
supports access to information of unprecedented scale and availability.
We survey some results, with several collaborators, exemplifying these
lines of research.
|
| Bio: |
Christos Papadimitriou is C. Lester Hogan professor of electrical engineering
and computer science and acting chair of the Department of Electrical
Engineering and Computer Sciences at the University of California at Berkeley.
He has a bachelor's degree from Athens Polytechnic and a doctoral degree from
Princeton University. He has taught at Harvard University, the Massachusetts
Institute of Technology, Athens Polytechnic, Stanford University, and the
University of California at San Diego. Papadimitriou has been at Berkeley
since 1995. He has written five books, and more than 200 articles on
algorithms and complexity, and their applications to various fields, including
databases, optimization, artificial intelligence, the life sciences, and
economics.
|
| Title: | Improving Index Performance through Prefetching |
| Speaker: |
Todd C. Mowry
Carnegie Mellon University |
| Abstract: |
This talk proposes and evaluates "Prefetching B+-Trees" (pB+-Trees), which
use prefetching to accelerate two important operations on B+-Tree indices:
searches and range scans. To accelerate searches, pB+-Trees use
prefetching to effectively create wider nodes than the natural data
transfer size: e.g., eight vs. one cache lines or disk pages. These wider
nodes reduce the height of the B+-Tree, thereby decreasing the number of
expensive misses when going from parent to child without significantly
increasing the cost of fetching a given node. Our results show that this
technique speeds up search and update times by a factor of 1.2-1.5 for
main-memory B+-Trees. In addition, it outperforms and is complementary to
"Cache-Sensitive B+-Trees." To accelerate range scans, pB+-Trees provide
arrays of pointers to their leaf nodes. These allow the pB+-Tree to
prefetch arbitrarily far ahead, even for nonclustered indices, thereby
hiding the normally expensive cache misses associated with traversing the
leaves within the range. Our results show that this technique yields over
a sixfold speedup on range scans of 1000+ keys. Although our experimental
evaluation focuses on main memory databases, the techniques that we propose
are also applicable to hiding disk latency.
Joint work with Shimin Chen and Phillip Gibbons.
|
| Bio: |
Todd C. Mowry is an Associate Professor in the School of Computer Science
at Carnegie Mellon University. He received an M.S.E.E. and Ph.D. from
Stanford University in 1989 and 1994, respectively. From 1994 through
1997, he was an Assistant Professor in the ECE and CS departments at the
University of Toronto prior to joining Carnegie Mellon University in July,
1997. The goal of Professor Mowry's research is is to develop new
techniques for designing computer systems (both hardware and software)
such that they can achieve dramatic performance breakthroughs at low cost
without placing any additional burden on the programmer. Specifically, he
has been focusing on two areas: (i) automatically tolerating the
ever-increasing relative latencies of accessing and communicating data
(via DRAM, disks, and networks) which threaten to nullify any other
improvements in processing efficiency; and (ii) automatically extracting
thread-level parallelism from important classes of applications where this
is currently not possible. He currently leads the STAMPede and Profet
projects, and is a co-leader of the Cache-Resident Database (CRDB)
project. He recently received a Sloan Research Fellowship and the TR100
Award from MIT's Technology Review magazine .
|
| Title: | Efficient Query Processing for Data Integration |
| Speaker: |
Zack Ives
University of Washington |
| Abstract: |
There are two key challenges posed by data integration query processing.
First, very little is known about the data sources, but query processors
rely on statistics about the data when choosing a "plan" to use in
executing a query. Second, data integration applications need to process
XML since it has become the standard format for data interchange. Current
techniques for processing XML do not suffice for data integration because
they do not produce initial answers quickly enough, particularly for data
being streamed across a network.
To address the first problem, I have developed convergent query
processing, which establishes a feedback loop between query execution and
optimization: the system monitors actual query plan performance and uses
this new knowledge to re-estimate the cost of alternative query plans. At
any point, execution can be stopped and a more promising plan can be
started in mid-stream. Convergent query processing not only addresses the
problem of limited knowledge in data integration, but it can also benefit
traditional databases. To address the second data integration challenge
-- returning initial answers quickly for XML queries -- I have developed
an XML query processing architecture that incrementally provides results
as data is read across the network. Combined with convergent query
processing, this XML architecture provides good performance for both
initial and final answers.
|
| Bio: |
Zack Ives is a graduating Ph.D. student at the University of Washington,
and will be joining the University of Pennsylvania in December. His
research interests are primarily centered in the area of data management
for heterogeneous data. His dissertation is on the topics of data
integration and XML.
|
| Title: | Join Operations for Data Cleaning |
| Speaker: |
Nick Koudas
AT&T Research Lab |
| Abstract: |
The problem of data cleaning, which consists of removing inconsistencies
and errors from data sources is well known in many areas of information
management. In this talk we will consider join operations suitable for
data cleaning tasks. We will first review state of the art techniques for
cleaning relational data sources through join operations using suitably
defined notions of approximate match as a join predicate. We will then
study the problem of cleaning XML data sources through correlations
realized as join operations.
A challenging aspect of such operations in the XML domain is the document
structure. Two documents might convey approximately or exactly the same
information but may be quite different in structure. Consequently
approximate match in structure, in addition to content, has to be folded
in the join operation. We quantify approximate match in structure and
content using well defined notions of distance. For structure, we propose
computationally inexpensive ways to reason about the tree edit distance
metric between two trees. We then show how the tree edit distance, and
other metrics that quantify distance between trees, can be incorporated in
a join framework. We introduce the notion of reference sets to facilitate
this operation. Intuitively, a reference set consists of data elements
used to project the data space. We characterize what constitutes a good
choice of a reference set and propose techniques to identify them. This
gives rise to a variety of approaches for the problem, which we formulate
and analyze. We demonstrate the practical utility of our solutions using
large collections of real XML data sets.
|
| Bio: |
Nick Koudas is a researcher at AT&T Shannon Laboratory and a member of the
database research department. He received a BS degree from the Computer
Engineering and Informatics Department at the University of Patras, Greece
in 1992 a MS degree from the University of Maryland at College Park in
1994, and a PhD degree from the University of Toronto in 1998. His
research interests are in the area of data integration, data warehousing,
data mining, database techniques for metadata management (XML).
|
| Title: | XML and Tree-walking (from practice to theory) |
| Speaker: |
Frank Neven
University of Limburg, Belgium |
| Abstract: |
The advent of XML added a new component to the popular logic and databases
connection: tree automata. One instance of such automata are tree-walking
ones. Tree-walking is a well-known paradigm from formal language theory
studied in the context of attribute grammars and tree transformations.
This paradigm materialized in XML research in various ways: as a
pattern-language, as an abstract model of XML transformations, in the
context of XML streaming, and as an abstraction of XSLT). In this talk, we
discuss the expressiveness of the tree-walking paradigm in two scenario's:
XML modeled with and without attributes. In particular, we discuss the
relationship with first-order and monadic second-order logic. Some proof
techniques are based on communication complexity.
|
| Bio: |
Frank Neven is an assistant professor at the University of Limburg,
Belgium. His research interests are in the area of database theory,
foundations of XML, formal languages and automata.
|
| Title: | An Example-based Approach to Extracting Semistructured Web Data |
| Speaker: |
Alberto H. F. Laender
Federal University of Minas Gerais, Brazil |
| Abstract: |
In this talk we present DEByE - Data Extraction By Example, an approach
to extracting data from Web sources that is based examples specified
by the user. The novelty of our approach is the fact that the user
specifies examples according to a structure of his liking and that this
structure is described at example specification time. For the
specification of the examples, the user interacts with a tool we
developed that adopts nested tables as its visual paradigm. Nested
tables are simple, intuitive, and allow shielding the user from
technical details (such as HTML tags, formatting operators, and
learning automata) related to the extraction problem. The examples
provided by the user are then used to generate patterns which allow
extracting data from new documents. For the extraction, DEByE adopts
a new bottom-up procedure which is very effective with various Web
sources, as demonstrated by our experiments.
|
| Bio: |
Alberto H. F. Laender received a B.S. degree in Electrical Engineering
(1974) and an M.Sc. degree in Computer Science (1979), both from the
Universidade Federal de Minas Gerais (UFMG), Belo Horizonte, Brazil,
and a Ph.D. degree in Computing (1984) from the University of East
Anglia, Norwich, UK. He joined the UFMG Computer Science Department
in 1975, where he is currently a Professor and the head of the Database
Research Group.
|
| Title: | LEO - DB2's LEarning Optimizer |
| Speaker: |
Volker Markl
IBM Almaden Research Center |
| Abstract: |
Most modern DBMS optimizers rely upon a cost model to choose the best query
execution plan (QEP) for any given query. Cost estimates are heavily
dependent upon the optimizer's estimates for the number of rows that will
result at each step of the QEP for complex queries involving many
predicates and/or operations. These estimates, in turn, rely upon
statistics on the database and modeling assumptions that may or may not be
true for a given database. In this talk, we present research on a DB2
prototype that has been carried out at the IBM Almaden Research Center. We
introduce LEO, DB2's LEarning Optimizer, as a comprehensive way to repair
incorrect statistics and cardinality estimates of a query execution plan.
By monitoring previously executed queries, LEO compares the optimizer's
estimates with actuals at each step in a QEP, and computes adjustments to
cost estimates and statistics that may be used during future query
optimizations. This analysis can be done either on-line or off-line on a
separate syst em, and either incrementally or in batches. In this way, LEO
introduces a feedback loop to query optimization that enhances the available
information on the database where the most queries have occurred, allowing
the optimizer to actually learn from its past mistakes. Our technique is
general and can be applied to any operation in a QEP (not just selection
predicates on base tables), including joins, derived results after several
predicates have been applied, and even to DISTINCT and GROUP-BY operators.
As shown by performance measurements on a 10 GB TPC-H data set, the runtime
overhead of LEO's monitoring is insignificant, whereas the potential
benefit to response time from more accurate cardinality and cost estimates
can be orders of magnitude.
|
| Bio: |
Volker Markl is a graduate of the Technische Universität München, where he
earned a Masters degree in Computer Science in 1995. He completed his PhD
in 1999 under the supervision of Rudolf Bayer. His dissertation on
"Relational Query Processing Using a Multidimensional Access Technique"
was honored "with distinction" by the German Computer Society
(Gesellschaft für Informatik). He also earned a degree in Business
Administration from the University Hagen, Germany in 1995.
Dr. Markl has been working at IBM's Almaden Research Center in San Jose, USA since January, 2001, conducting research in query optimization, indexing, and self-managing databases. Volker Markl is spearheading the LEO project, an effort on autonomic computing with the goal to create a self-tuning optimizer for DB2 UDB. He also is the Almaden chair for the IBM Data Management Professional Interest Community (PIC).
|
Last modified: Wed May 29 12:02:56 EST 2002