|Title:||Seamless Integration of Biological Applications into a Database Framework|
Data Logic, a division of Gene Logic Inc.
There are more than two hundred biological data repositories available
for public access, and a vast number of applications to process and
interpret biological data. A major challenge for bioinformaticians is
to extract and process data from multiple data sources using a variety
of query interfaces and analytical tools.
In this talk, I will describe tools that respond to this challenge by
providing support for cross-database queries and for integrating
analytical tools in a query processing environment. In particular, I
will describe two alternative methods for integrating biological data
processing within traditional database queries: (a) "light-weight"
application integration based on Application Specific Data Types
(ASDTs) and (b) "heavy-duty" integration of analytical tools based on
mediators and wrappers. These methods are supported by the
Object-Protocol Model (OPM) suite of tools for managing biological
This is jointed work with A. Kosky and V. Markowitz.
|Title:||The Design and Implementation of Microsoft Repository|
|Speaker:|| Philip A. Bernstein,
Microsoft Research |
Philip A. Bernstein is a Senior Researcher in the database group of Microsoft Research and a contributor to the Microsoft Repository product group, where he was Architect from 1994-1998. He has published over 90 articles and 3 books on database systems and related topics, and has contributed to many database system products, prototypes, and standards. His latest book is "Principles of Transaction Processing" (Morgan Kaufmann Publishers, 1996). He received his Ph.D. from University of Toronto in 1975.
Phil will also be giving a Colloquium on Tuesday, October 19th.
Microsoft Repository is an object-oriented meta-data management
facility that ships in Microsoft SQL Server and Visual Studio. It
includes a repository engine that implements a set of object-oriented
interfaces on top of a SQL database system. A developer can use these
interfaces to define information models (i.e., schemas) and manipulate
instances of the models. It also includes the Open Information Model,
which is a set of information models that cover object modeling,
database modeling, and component reuse. This talk presents an overview
of the interfaces, implementation and applications of Microsoft
Repository. We also focus on two interesting aspects of the latest
release: version and configuration management and prefetching based on
the context of recent navigational operations.
Meet and talk with the speaker at an informal roundtable
|Title:||WHOWEDA - Warehouse of Web Data|
Sanjay Kumar Madria, Purdue University
growth of the internet has dramatically changed the way in which
information is managed and accessed. Information on the WWW is
important not only to individual users, but also to business
organizations especially when decision making is concerned. These
information are placed independently by different organization, thus
documents containing related information may appear at different
web-sites. It is established that serach engines have several
limitations to retrieve useful information. To overcome limitations of
search engines and provide the user with a powerful and friendly query
mechanism for accessing information on the web, the critical problem
is to find effective ways to build web data models and query
languages. Also, to provide effective mechanism to manipulate these
information of interest to garner additional useful information. The
talk deals with the web data model, web algebra, query language and
knowledge discovery in the context of WHOWEDA (Warehouse of Web Data)
project. The key objective is to design and implement a web warehouse
that materializes and manages useful information from the web. In
particular, we discuss building a web warehouse using database
approach of managing and manipulating the warehouse containing
strategic information coupled from the web. Depending on time, some
other aspects of our talk may include web change management, web data
mining and discussion on important open issues.
|Title:||Document resemblance and related issues|
Andrei Broder is chief technology officer of the AltaVista Search division in the AltaVista Company. Previously he was a senior member of the research staff at Compaq's Systems Research Center in Palo Alto, California. He graduated from Technion, Israel's Institute of Technology, and did his Ph.D. in Computer Science at Stanford University under Don Knuth. He has written and co-authored more than 60 scientific papers and numerous patents. His main research interests are the design, analysis, and implementation of probabilistic algorithms and supporting data structures, in particular in the context of web-scale applications.
Andrei will also be giving a Colloquium on Tuesday, October 26th.
People often claim that two web pages are "the same" or "roughly the
same", even though classic distances on strings (Hamming, Levenshtein,
etc.) might indicate that the two pages are far apart.
To formalize these intuitive ideas we defined the mathematical concept of document resemblance. The resemblance can be estimated using a fixed size ``sketch'' for each document. For a large collection of documents (say 200 million) the size of this sketch is of the order of a few hundred bytes per document. However, for efficient large scale web indexing it is not necessary to determine the actual resemblance value: it suffices to determine whether newly encountered documents are duplicates or near-duplicates of documents already indexed; In other words, it suffices to determine whether the resemblance is above a certain threshold. We show how this determination can be made using a "sample" of less than 50 bytes per document.
The basic approach for computing resemblance has two aspects: first, resemblance is expressed as a set intersection problem, and second, the relative size of intersections is evaluated by a process of random sampling that can be done independently for each document. The process of estimating the relative size of intersection of sets and the threshold test discussed above can be applied to arbitrary sets, and thus might be of independent interest.
The ideas for filtering near-duplicate documents discussed here have been successfully implemented and are in current use in the context of the AltaVista search engine.
This talk is tilted towards ``algorithm engineering'' rather than ``algorithm analysis'' and very little mathematical background is required.
Meet and talk with the speaker at an informal roundtable
|Title:||Personalization of push technology|
Opher Etzion is an IBM Research staff member. He joined the IBM Research Laboratory in Haifa in 1997, where he leads the activities on reactive systems. In parallel he is an adjunct senior lecturer at the Technion, where he also supervises Ph.D. and M.Sc. theses. Prior to joining IBM, he has been a full-time faculty member at the Technion, where he has been the founding head of the information systems engineering area and graduate program. Prior to his graduate studies, he held professional and managerial positions in industry and in the Israel Air-Force. His research interests include active technology (active databases and beyond), temporal databases, middleware systems and rule-base systems. He is a member of the editorial board of the IIE Transactions Journal, was a guest editor in the Journal of Intelligent Information Systems in 1994, served as a coordinating organizer of the Dagstuhl seminar on Temporal databases in 1997, and is the co-editor of the book "Temporal Databases - Research and Practice" published by Springer-Verlag. He also served in many conference program committees as well as national committees and has been program and general chair in the NGITS workshop series. He is the program chair of CoopIS'2000.
Personalization of knowledge distribution as a result of new events
that are reported by various sources is one of the major challenges in
contemporary Web-based information systems. According to the
Personalization paradigm, each user should get the right information
at the right time in the right form, where the notion of "right" is
user dependent. Current technologies support the publish/subscribe
paradigm, by which a subscriber may subscribe to events that it is
interested in, among those broadcasted by various channels. The major
weakness of this approach is that in many cases the user is not
interested in the basic events, but in a combination of events that
may be quite complex. Example: if within a period of two hours from
the start of the trading day, either IBM stock or Microsoft stock is
up in at least 3% more than the change in the Dow Jones index then
send an immediate alert". The talk describes the "Amit" (Active
Middleware Technology) project of IBM Research Laboratory in Haifa,
and concentrates on the concept of reactive situation, the language to
define such situations, the architecture and the utilization of this
concept for perosnalization of the push technology for system
management, electronic commerce, awareness systems and other
|Title:||Binary String Relations: A Foundation for Spatiotemporal Knowledge Representation|
Thanasis Hadzilacos, Univ. of Patras
paper is concerned with the qualitative representation of
spatiotemporal relations. We initially propose a multiresolution
framework for the representation of relations among 1D intervals,
based on a binary string encoding. We subsequently extend this
framework to multiple dimensions, thus allowing the description of
spatiotemporal relations at various contexts. The feasible relations
at a particular resolution level are inherently permeated by a poset
structure, called conceptual neighbourhood, upon which we propose
efficient relation inferencing mechanisms. Finally, we discuss the
application of our model to spatiotemporal reasoning, which refers to
the classic problems of satisfiability and deductive closure of a set
of spatiotemporal assertions.
Keywords Spatiotemporal relations, spatiotemporal reasoning, conceptual neighbourhoods.
Work with Vasilis Delis
|Title:||Techniques for Online Exploration of Large Object-Relational Datasets|
Peter Haas, IBM Almaden Research Labs |
Peter Haas received an S.B. in Engineering and Applied Mathematics from Harvard University in 1978. In 1979 he received an M.S. in Environmental Engineering from Stanford University. From 1979 to 1981 he was a Staff Scientist at Radian Corporation, where he performed air quality modeling studies for the EPA, Texas Air Control Board, and corporate clients. Three years later, he received an M.S. in Statistics from Stanford University. He received a Ph.D. in Operations Research from Stanford University in 1986 and, after a brief stint as an Assistant Professor in the Department of Decision and Information Sciences at Santa Clara University, he became a Research Staff Member at IBM Almaden Research Center in 1987, where he has been ever since. While working for IBM, he has conducted basic research on modeling and simulation of discrete-event stochastic systems. In the course of this work, he has developed a framework for comparing the modeling power of different formalisms for discrete-event systems, provided techniques for checking applicability of regenerative simulation methods and standardized-time-series methods to specific computer, manufacturing, and telecommuncation models, and developed new procedures for measurement and estimation of random delays. Other research projects have included development of sampling-based estimates of join selectivity in relational database management systems, stochastic models of load-balancing in parallel database systems, and techniques for powering down disk drives in PC's. In 1992-93 he spent a sabbatical year at the University of Wisconsin, where he was an Honorary Fellow at the Center for the Mathematical Sciences. In 1999, he was a visiting lecturer in the Department of Engineering Economic Systems and Operations Research at Stanford University, where he taught a graduate-level course in Simulation Methodology.
|Abstract:|| We review
techniques for exploring large object-relational datasets in an
interactive online manner. The idea is to provide continuously updated
running estimates of the final query results to the user, along with
an indication of the precision of the estimates. The user can then
halt the query as soon as the answer is sufficiently precise---the
time required to obtain an acceptable approximate answer can be faster
by orders of magnitude than the time needed to completely process the
query. We describe methods for online processing of aggregation
queries (SUM, AVG, VARIANCE, etc.), online visualization, and online
display of a set of result records. By way of comparison, we also give
a brief review of methods that use precomputed results to rapidly
obtain approximate query answers.
|Title:||Que Sera Sera: The Coincidental Confluence of Economics, Business, and Collaborative Computing|
Michael L. Brodie, GTE Labs
Dr. Michael L. Brodie is Sr. Technologist, GTE Technology Organization, Waltham, MA and Chief Scientist (SAP Program) at GTE Corporation. He works on large-scale strategic Information Technology (IT) challenges for GTE Corporation's senior executives. His industrial and research focus is on large-scale information systems - their total life cycle, business and technical contexts, core technologies, and "integration" within in a large scale, operational telecommunications environment. Dr. Brodie has authored over 120 books, chapters, journal articles and conference papers. He has presented keynote talks, invited lectures and short courses on many topics in over twenty-five countries.
|Abstract:|| "The World Wide Web
(WWW) changes everything" but how? Amazon.com and eBay as exemplars of
e-business do not begin to suggest what is possible. Technologists
often think of technology change in technological terms. Technology
serves to realize more significant changes such as the way business is
conducted. More radical and more fundamental changes are those related
to new economic models that underly, predict, and enable new business
models and which define technology requirements. The potential offered
by the next generation of technology will take at least a decade to
understand and realize, since it involves fundamental change not only
in computing models and practice but also in business and
economics. The current industrial revolution will also lead to
significant social and political change.
This is a time of radical change in what appears to be the parallel worlds of technology, business, and economics. They are not parallel. The intimate relationship of these domains has previously resulted in collateral homeostasis due in part to their interdependence. Current changes in these domains are leading to collateral change. This presentation focuses on the confluence of these changes. As technologists, we see technology and related business change daily. Economic change is less obvious but more radical. Current technology is designed to support 400-year-old economic models, which involve managing within organizational boundaries. It is inadequate to support new economic models, which involve going beyond those boundaries.
This presentation explores the next generation of computing based on the confluence of radical and coincidental changes in economics, business, and technology. Whereas technology is a key enabler of change, it is the servant, not the master. Without a depth of understanding of this enabling role and the context in which technology serves, technology can be misguided and its developers can lose perspective. This presentation outlines a proposal made to the US President's Office of Science and Technology for technology research for the next decade, which calls for new computational models and infrastructure to support the next generation of computing, collaborative computing. A major focus to reconsider the role of data in computing. This is only one view. Que Sera, Sera
This presentation is in support of THE NETWORKING AND INFORMATION TECHNOLOGY RESEARCH AND DEVELOPMENT ACT, an Act being introduced in the House of representatives for funding basic research for the fiscal years 2000-2004.
Meet and talk with the speaker at an informal roundtable
|Title:||RDBMS: From Fantasy to Infrastructure|
Dr. Bruce Lindsay is an expert on most aspects of Relational Database engines. He has contributed to many areas of database technology, including concurrency control and recovery technology, distributed and parallel architectures, and Object Relational extensions to SQL. Dr. Lindsay is an IBM Fellow and has participated in the development of Relational Database systems such as SystemR (1st full function RDBMS), R* (distributed RDBMS), Starburst (extensible RDBMS), and DB2 UDB.
Sponsored by the IBM Technical Innovation Speaker Series
|Title:||What Next? A Few Remaining Problems in Information Technology|
Jim Gray, Microsoft Research
Jim Gray is a specialist in database and transaction processing computer systems. At Microsoft his research focuses on scaleable computing: building super-servers and workgroup systems from commodity software and hardware. Prior to joining Microsoft, he worked at Digital, Tandem, IBM and AT&T on database and transaction processing systems including Rdb, ACMS, NonStopSQL, Pathway, System R, SQL/DS, and DB2. He is editor of the Performance Handbook for Database and Transaction Processing Systems, and coauthor of Transaction Processing Concepts and Techniques. He did has PhD dissertation at Berkeley, is a Member of the National Academy of Engineering, Fellow of the ACM, Trustee of the VLDB Foundation, and Editor of the Morgan Kaufmann series on Data Management, a member of the National Research Council's Computer Science and Telecommunications Board, and a member of the President's Information Technology Advisor Committee. He received the 1998 ACM Allan M. Turing Award.
|Abstract:|| Babbage's vision
of computing has largely been realized. We are on the verge of
realizing Bush's Memex. But, we are some distance from passing the
Turing Test. These three visions and their associated problems have
provided long-range research goals for many of us. For example, the
Scaleability problem has motivated me for several decades. This talk
defines a set of fundamental research problems that broaden the
Babbage, Bush, and Turing visions. They extend Babbage's
computational goal to include highly-secure, highly-available,
self-programming, self-managing, and self-replicating systems. They
extend Bush's Memex vision to include a system that automatically
organizes, indexes, digests, evaluates, and summarizes information (as
well as a human might). Another group of problems extends Turing's
vision to include prosthetic vision, speech, hearing, and other
senses. Each problem is simply stated and each is orthogonal from the
others, though they share some common core technologies.
Jim Gray will be giving an additional seminar talk for the Database
|Title:||Architecture-Conscious Database Systems|
Natassa Ailamaki, University of Wisconsin
Modern high-performance processors employ sophisticated techniques
to overlap and simultaneously execute multiple computation and
memory operations. Intuitively, these techniques should help
database applications, which are becoming increasingly compute and
memory bound. Unfortunately, recent research indicates that,
unlike scientific workloads, database systems' performance has not
improved commensurately with increases in processor speeds. As the
gap between memory and processor speed widens, research on
database systems has focused on minimizing memory latencies for
isolated algorithms. However, in order to design high-performance
database systems it is important to carefully evaluate and
understand the interaction between the database software and the
The first part of this talk introduces a framework for analyzing query execution time on a database system running on a server with a modern processor and memory architecture. Experiments with a variety of benchmarks show that database developers should (a) optimize data placement for the second level of data cache, (b) optimize instruction placement to reduce first-level instruction cache stalls, but (c) not expect the overall execution time to decrease significantly without addressing stalls related to subtle implementation issues (e.g., branch prediction).
The second part of the talk focuses on optimizing data placement for access to the second-level cache. Most commercial DBMSs store records contiguously on disk pages, using the slotted-page approach (NSM). During single attribute scan, NSM exhibits poor spatial locality and has a negative impact on cache performance. The decomposition storage model (DSM) has better spatial locality, but incurs a high record reconstruction cost. We introduce Partition Attributes Across (PAX), a new layout for data records that is applied orthogonally to NSM pages and offers optimized cache utilization with no extra space or time penalty.
|Title:||Safe Deals Between Strangers|
IBM Almaden Research Center
E-business, information serving, and ubiquitous computing will create
heavy request traffic from strangers or even incognitos. Such requests
must be managed automatically. Two ways of doing this are well known:
giving every incognito consumer the same treatment, and rendering
service in return for money. However, different behavior will be often
wanted, e.g., for a university library with different access policies
for undergraduates, graduate students, faculty, alumni, citizens of the
same state, and everyone else. For a data or process server contacted
by client machines on behalf of users not previously known, we show how
to provide reliable automatic access administration conforming to
service agreements. Implementations scale well from very small
collections of consumers and producers to immense client/server
networks. Servers can deliver information, effect state changes, and
control external equipment.
Consumer privacy is easily addressed by the same protocol. We support consumer privacy, but allow servers to deny their resources to incognitos. A protocol variant even protects against statistical attacks by consortia of service organizations. One e-commerce application would put the consumer's tokens on a smart card whose readers are in vending kiosks. In e-business we can simplify supply chain administration. Our method can also be used in sensitive networks without introducing new security loopholes.