UofT Database Group LogoUofT Database Group Logo
UofT Database Group LogoUofT Database Group Logo

1998-1999 Database Seminar Series

October 15, 1998

Title: Intersection Stacking for Multi-dimensional Aggregation in RDBMSs
Speaker: Roberta Cochrane, IBM Almaden Research Center, San Jose, CA
Roberta Cochrane is a Research Staff Member at the IBM Almaden Research Center in San Jose, California. She received her Ph.D. in Computer Science from the University of Maryland in 1992. Currently, she is involved in research for technologies to support Business Intelligence in relational database management systems. She has also done extensive research in active database systems. She implemented the Starburst Rule System and led the design and implementation of triggers and constraints for both serial and parallel versions of IBMs DB2 Universal Database System. She played a major role in defining the SQL3 standard for triggers and constraints.
Abstract: Business Intelligence applications perform complex aggregation for large amounts (typically 1 to 10 Terabytes) of data. This puts increasing demands on database systems to provide native support for such processing, often referred to as Online Analytical Processing (OLAP). SQL has recently extended the group-by clause to provide primitives for common OLAP computations in the DBMS, allowing the DBMS more flexibility in processing and optimizing such aggregation. These computations are the data-cube, roll-up, concatenations of roll-up (multi-dimensional cube), and combinations of ad-hoc grouping elements. The specification of the group-by clause can expand into many grouping sets. For example, the cube alone will result in 2**n grouping sets where n is the number of grouping elements. In this talk I will present the SQL OLAP extensions and describe a novel technique for stacking grouping operations. Our technique results in linear expansion of grouping sets, greatly reducing the amount of complexity and resources required to optimize and compute such queries.
  • Location: Lash Miller 158 (80 St. George Street)
  • Time: 10:00am
[Return to Toronto DB Seminar Index]

October 15, 1998

Title: Web Search Technology
Speaker: Daniel Ford, IBM Almaden Research Center, San Jose, CA
Daniel Ford is the Manager of the Web Technologies Department at the IBM Almaden Research Center in San Jose, California. He received his Ph.D. in Computer Science from the University of Waterloo in 1990. Since 1992 he has been a Research Staff Member at IBM Almaden doing research into Tertiary RAID (RAIL) storage systems, and more recently in Web Search technologies.
Abstract: Grand Central Station is a system that extends search to all digital sources of data. The system consists of components to access and understand data sources and generate searchable metadata ("Gatherers"), a metadata repository for satisfying ad hoc queries, and an extensible profiling system for processing persistent queries. The Gatherer is an extensible crawler framework written in Java that is capable of using a variety of protocols (e.g., http, ftp, nntp, odbc, cics, pop3) to access and understand a wide range of data formats (HTML, Java Bytecode, PowerPoint, TAR/Zip archives, and many others). The Gatherer generates summaries of each data source it encounters in an instance of XML we call SumML (Summary Metalanguage). Key features of the Gatherer are its ability to be easily extended by adding protocol and data source specific code, and its ability to run, unchanged, on any platform that supports Java. The metadata repository is less advanced, but the Profiling framework is moving forward to encompass multimedia profiling. The system has been deployed in the form of a Java specific search engine called "jCentral" and is accessible from IBM's Java home page (http://www.ibm.com/java). This talk will present and demonstrate the system, and discuss future directions into searching video, image and audio.
  • Location: Lash Miller 158 (80 St. George Street)
  • Time: 11:00am
[Return to Toronto DB Seminar Index]

October 29, 1998

Title: Research Projects in Data Warehousing, Data Mining and Heterogeneous DBMS
Speaker: Renée J. Miller, University of Toronto
Abstract: A common requirement among many of today's data-intensive applications is the need to efficiently manage and analyze large volumes of semi-structured, heterogeneous data. It is this task that lies at the heart of my research agenda. Of particular interest are the tasks required to support data warehousing. Data warehouses provide integrated access to historical data collected from legacy data sources for use in decision support and data analysis. Data warehousing uses techniques developed in a variety of areas including: heterogeneous DBMS-the management and integration of heterogeneous schemas and data; database publishing-the browsing and querying of complex, structured or semi-structured databases to users who are unfamiliar with the data and its organization; and data mining-the search for patterns in data that are useful in data analysis and data reduction. I will overview my past and on-going work in some of these areas.
  • Location: Wallberg Rm 219
  • Time: 10:00am
[Return to Toronto DB Seminar Index]

November 24, 1998

Title: Hierarchical Data Management
Speaker: H.V. Jagadish, AT&T Labs
Abstract: Much of the data we deal with every day is organized hierarchically: file systems, library classification schemes and yellow page categories are salient examples. Yet, commercial databases use a flat relational model that does not conveniently accommodate hierarchies. The question we seek to address is whether it is possible to combine the benefits of a hierarchical representation of data (better support for heterogeneity, autonomy, scaling) with the strong points of the relational data model (clean data model, declarative query language).

We do so with particular focus on directory servers and in particular LDAP (lightweight directory access protocol). We develop a hierarchical data model based on LDAP, and explore some possibilities for a query language.

  • Location: Galbraith Rm 220
  • Time: 10:00am
Roundtable: Meet and talk with the speaker at an informal roundtable discussion.
  • Location: DL 378
  • Time: 11:00am
[Return to Toronto DB Seminar Index]

December 1, 1998

Title: Exploratory Association Rule Mining with Constraints
Speaker: Raymond Ng, Univ. of British Columbia
Abstract: From the standpoint of supporting human-centered discovery of knowledge, the present-day model of mining association rules suffers from the following serious shortcomings: (i) lack of user exploration and control, (ii) lack of focus, and (iii) rigid notion of relationships. In effect, this model functions as a black-box, admitting little user interaction in between. We propose, in this talk, an architecture that opens up the black-box, and supports constraint-based, human-centered exploratory mining of associations. The foundation of this architecture is a rich set of constraint constructs, including domain, class, and SQL-style aggregate constraints, which enable users to clearly specify what associations are to be mined. We propose constrained association queries as a means of specifying the constraints to be satisfied by the antecedent and consequent of a mined association.

In this talk, we mainly focus on the technical challenges in guaranteeing a level of performance that is commensurate with the selectivities of the constraints in an association query. To this end, we introduce and analyze two properties of constraints that are critical to pruning: anti-monotonicity and succinctness. We then develop characterizations of various constraints into four categories, according to these properties. Finally, we describe a mining algorithm called CAP, which achieves a maximized degree of pruning for all categories of constraints. Experimental results indicate that CAP can run much faster, in some cases as much as 80 times, than several basic algorithms. This demonstrates how important the succinctness and anti-monotonicity properties are, in delivering the performance guarantee.

To conclude, we discuss the implications of this work on the issue of how to integerate association rule mining with DBMS.

  • Location: Sanford Fleming 1105
  • Time: 11:00am
Roundtable: Meet and talk with the speaker at an informal roundtable lunch discussion. Pizza will be served.
  • Location: DL 378
  • Time: Noon
[Return to Toronto DB Seminar Index]

December 3, 1998

Title: DISIMA Project: A Distributed, Interoperable Image Database System
Speaker: M. Tamer Özsu, Univ. of Alberta
Abstract: This talk describes the DISIMA project currently under investigation at the Laboratory for Database Systems Research of the University of Alberta. DISIMA project aims at building an image database system enabling content-based querying. The main emphasis of the work is content-based indexing and querying of images, in particular with respect to spatial relationships. The identifying characteristics of our project are (a) object-oriented approach to image data management, (b) use of image processing and indexing techniques for efficient querying and access to image databases, and (c) interoperability among various image storage systems. DISIMA model allows the user to assign diffent semantics to an image component (semantic independence) and an image representation can be changed without any effect on applications using it (representation independence). The architecture involves different image sources including WWW-servers and file systems. The system provides an OQL-based multimedia query language for accessing images (and video). DISIMA is currently being implemented on top of Objectstore. The talk will focus on system architecture, image modeling, type system design, query language and indexing issues.
  • Location: Galbraith Rm 248
  • Time: 11:00am
Roundtable: Meet and talk with the speaker at an informal roundtable discussion.
  • Location: DL 378
  • Time: 10am, Friday December 4th.
[Return to Toronto DB Seminar Index]

December 8, 1998

Title: Distinguished Lecture Series: Array Databases
Speaker: Ken Salem, University of Waterloo
Ken Salem is an Associate Professor at the University of Waterloo and director of the Distributed Batch Controller Project.
Abstract: Arrays are an appropriate data model for images, gridded output from computational models, and other types of data. This talk will describe an array database system based on a multidimensional data model and a simple query algebra called AML. To write expressions in AML, it is first necessary to supply a set of domain-specific functions. AML can then be used to apply those functions to arrays in a structured way. Examples will be used to illustrate some useful things that can be written in AML, and to show how AML expressions can be treated declaratively and optimized by a database system.
  • Location: Sanford Fleming 1105
  • Time: 11:00am
[Return to Toronto DB Seminar Index]

January 14, 1999

Title: Content-Based Organization of the Information Space in Multi-Database Networks
Speaker: Mike Papazoglou, Tilberg Univ.
Professor Papazoglou is the director of INFOLAB at Tiburg University in the Netherlands. His interests include information systems, data modelling, distributed information services, and agent-based systems. He has recently editted a book on Cooperative Information Systems and another one on Object-Oriented Data Modelling.
Abstract: Rapid growth in the volume of network-available data, complexity, diversity and terminological fluctuations, at different data sources, render network-accessible information increasingly difficult to achieve. The situation is particularly cumbersome for users of multi-database systems who are expected to have prior detailed knowledge of the definition and uses of the information content in these systems.

This talk will describe a conceptual organization of the information space across collections of component systems in multi-databases that provides serendipity, exploration and contextualization support. In this way users can achieve logical connections between concepts they are familiar with and schema terms employed in multi-database systems. Large-scale searching for multi-database schema information is guided by a combination of lexical, structural and semantic aspects of schema terms in order to reveal more meaning both about the contents of a requested information term and about its placement within the distributed information space.

  • Location: Galbraith 248
  • Time: 10:00am
Roundtable: Meet and talk with the speaker at an informal roundtable discussion.
  • Location: SF 3207 Note Room Change
  • Time: 2pm, TUESDAY, January 12th
[Return to Toronto DB Seminar Index]

January 19, 1999 Tuesday

Title: DATABASE SYSTEMS: Trends in Database System Development
Speaker: Vidojko Ciric, University of Belgrade
Dr. Vidojko Ciric is a professor of computer science at the University of Belgrade. He received a Ph.D. degree in 1969 from Rice University, EE Department, Houston, Texas, USA. He was granted a research scholarship from NSF & NASA and had an active part in the Apollo 8 flight simulation project at the University. His research interests include Software engineering, OO programming, Database systems and Case tool design. His publication list includes more than 160 refereed journal or conference papers and 20 books in system and computer science. According to a survey related to the Guttenberg anniversary, the book "General Sensitivity Theory" by Tomovic and Vukobratovic, in which Prof. Ciric has written a chapter: "A New Controllability Concept in Sensitivity Design of Optimal Control Systems", published by American Elsevier, 1973, belongs to the group of 200 most referenced technical books in engineering. He was a referee for IFAC Automatica and IEEE Computer Society. Currently, he is an active referee and a member of program committee of two IASTED Conferences on: Software Engineering and Applied Informatics. He organized and was a chairmen of special session: Issues in OO Design and Programming at the IASTED SE'98 Conference in Las Vegas.
Details:
  • Location: DL 378
  • Time: 2pm
[Return to Toronto DB Seminar Index]

January 21, 1999

Title: Index Structures for Path Expressions
Speaker: Dan Suciu, AT&T Research
Dan Suciu is a member of the technical staff at AT&T Labs. He is working on semistructured data, and has been involved in several projects of semistructured data (UnQL, Strudel, XML-QL). He received his PhD from the University of Pennsylvania in 1995, and his BS from the Polytechnic of Bucharest (Romania).
Abstract: Queries over semistructured databases contain regular path expressions. Their naive computation is prohibitively high, since in most cases they require the traversal of the entire database. We propose a novel and general index structure for computing regular path expressions on semistructured data, called T-index, whose main features are: (1) it can trade-off space for generality, (2) it can be always efficiently computed (in PTIME over the database), and (3) it is provably space efficient. T-indexes generalize several known index structures, such as: data-guides (for semistructured data), Access Support Relations (for OODBs), and Pat trees (for full text indexes).

Joint work with Tova Milo (Tel Aviv University)

  • Location: Galbraith 248
  • Time: 10:00am
Roundtable: Meet and talk with the speaker at an informal roundtable discussion.
  • Location: DL 378
  • Time: 2:00pm, Thursday, January 21
[Return to Toronto DB Seminar Index]

February 4, 1999 Thursday

Title: A Deterministic Model for Semi-Structured Data
Speaker: Peter Buneman, University of Pennsylvania
Abstract: This is a preliminary report on a new model for semi-structured data. The idea of semi-structured data evolved, in part, from various syntactic representations of data such as AceDB, OEM and various data formats and it has more recently been used effectively to design query languages for XML. Insofar as there is an agreed model, it is simply an edge-labeled graph. However this description begs a number of important questions: What can constitute an edge label? Are there values associated with the vertices? Is there a separate labeling system for vertices to provide them with independent identity? We describe here a new model for semi-structured data. It is more restrictive than the recently described models in that it is deterministic. The edges emanating from any node in the graph have distinct labels. It is less restrictive in that the edges can carry data and may have structure. In fact they may themselves be small pieces of semi-structured data. The advantage of this approach is that each component of the database is uniquely identified by a path. Paths serve as object identifiers or l-values; but unlike object identifiers, paths also have structure, and a number of useful database operations may be obtained by manipulation of this structure. The motivation for this model came in part from the need to develop an annotation system for "curated" databases. While the databases typically have a well-defined and rich structure, annotations are arbitrary and unpredictable, and they require some form of semi-structured approach. Work with Alin Deautsch and Wang-Chiew Tan of the University of Pennsylvania.
  • Location: Galbraith 248
  • Time: 10:00am
Roundtable: Meet and talk with the speaker at an informal roundtable discussion.
  • Location: DL Pratt 378
  • Time: 11:00am (immediately following the talk)
[Return to Toronto DB Seminar Index]

February 23, 1999 Tuesday

Title: Tableau Techniques for Querying Information Sources Through Global Schemas
Speaker: Gösta Grahne, Concordia University
Abstract: The foundational homomorphism techniques introduced by Chandra and Merlin for testing containment of conjunctive queries have recently attracted renewed interest due to their central role in information integration applications. We show that generalizations of the classical tableau representation of conjunctive queries are useful for computing query answers in information integration systems where information sources are modeled as views defined on a virtual global schema. We consider a general situation where sources may or may not be known to be correct and complete. We characterize the set of answers to a global query and give algorithms to compute a finite representation of this possibly infinite set, as well as its certain and possible approximations. We show how to rewrite a global query in terms of the sources in two special cases, and show that one of these is equivalent to the Information Manifold rewrite of Levy et al.
  • Location: DL Pratt 378
  • Time: 12:30pm (with pizza lunch provided)
[Return to Toronto DB Seminar Index]

March 2, 1999 Tuesday

Title: Distinguished Lecture Series : Automated Verification = Graphs, Automata, and Logic
Speaker: Moshe Vardi, Rice University
Moshe Y. Vardi is a Noah Harding Professor of Computer Science and Chair of Computer Science at Rice University. Prior to joining Rice in 1993, he was at the IBM Almaden Research Center, where he managed the Mathematics and Related Computer Science Department. His research interests include database systems, computational-complexity theory, multi-agent systems, and design specification and verification. Vardi received his Ph.D. from the Hebrew University of Jerusalem in 1981. He is the author and co-author of over 100 technical papers, as well as a book titled "Reasoning about Knowledge". Vardi is the recipient of 3 IBM Outstanding Innovation Awards. He is an editor of several international journals and currently serves as the General Chair of the Federated Logic Conference.
Abstract: In automated verification one uses algorithmic techniques to establish the correctness of the design with respect to a given property. Automated verification is based on a small number of key algorithmic ideas, tying together graph theory, automata theory, and logic. In this self-contained talk I will describe how this "holy trinity" gave rise to automated-verification tools.
  • Location: Sanford Fleming 1105
  • Time: 11:00am
Roundtable: Meet and talk with the speaker at an informal roundtable discussion.
  • Location: DL 378
  • Time: 2:30pm
[Return to Toronto DB Seminar Index]

March 5, 1999

Title: Myths and Realities in Cluster I/O
Speaker: Remzi Arpaci-Dusseau, University of California, Berkeley
Remzi H. Arpaci-Dusseau is currently a graduate student at U.C. Berkeley, under advisor David Patterson. He received a B.S. in Computer Engineering, summa cum laude, from the University of Michigan in 1993, and a Masters in Computer Science from U.C. Berkeley in 1996. He plans to complete his dissertation work in the fall of 1999. His interests lay largely in the area of experimental distributed and parallel systems, including operating systems, file systems, databases, and computer architecture. His most recent work has been on River, a software system designed to provide consistent, high-performance for cluster applications with large I/O demands. He and his wife, Andrea Arpaci-Dusseau, broke and still hold two world records in external sorting. For more information, see: http://www.cs.berkeley.edu/~remzi
Abstract: In this talk, I will discuss three myths, or popular beliefs, that have grown around clusters of workstations, especially in the arena of high-performance I/O. In discussing the validity of these myths, I will answer three basic questions:
  • What are clusters good for?
  • How should we design clustered workstations?
  • What type of software support is necessary?
I will show empirically that modern clusters are excellent at moving data; the proof of this is NOW-Sort, currently the world-record holding external sort. I will also explore both the strengths and weaknesses of modern cluster hardware, in particular what it means for a system to be well-balanced. Finally, I will present a new software system called River, which allows cluster applications to be constructed in a robust and straight-forward manner.
  • Location: GB 244
  • Time: 11:00am
[Return to Toronto DB Seminar Index]

March 4, 1999

Title: Implicit Coscheduling: Coordinated Scheduling with Implicit Information in Distributed Systems
Speaker: Andrea Arpaci-Dusseau, University of California, Berkeley
Abstract: Building fault-tolerant, scalable services in a distributed system has typically involved complex implementations. We believe that implicit control can greatly simplify the construction of such services. In an implicitly-controlled system, cooperating components do not explicitly contact other components for control or state information; instead, components infer remote state by observing naturally-occurring local events and their corresponding implicit information, i.e., information available outside of a defined interface.

To concretely demonstrate the advantages of implicit control, we propose implicit coscheduling, an algorithm for dynamically coordinating the time-sharing of communicating processes across distributed machines. Coordinated scheduling, required for communicating processes to leverage the performance benefits of switch-based networks and low overhead protocols, has traditionally been achieved with explicit coscheduling; however, implementations of explicit coscheduling often suffer from multiple failure points and interact poorly with client-server, interactive, and I/O-intensive jobs. With implicit coscheduling, processes in a general-purpose workload can coordinate their own scheduling by simply reacting to implicit information, such as the round-trip time and arrival rate of messages.

In this talk, we describe the two principle components of implicit coscheduling: a fair, preemptive operating system scheduler and conditional two-phase waiting, a generalization of traditional two-phase waiting in which spin-time is increased depending upon events that occur while the process waits. We show through both simulation and an implementation on a cluster of 32 workstations that implicit coscheduling efficiently and fairly handles competing applications with a wide range of communication characteristics.

For relevant papers and more information, see http://now.CS.Berkeley.EDU/Implicit

  • Location: GB 248
  • Time: 11:00am
[Return to Toronto DB Seminar Index]

March 16, 1999

Title: Scalable Decision Tree Construction
Speaker: Johannes Gehrke, University of Wisconsin, Madison
Abstract: Classification is an important data mining problem. Given a training database of records, each tagged with a class label, the goal of classification is to build a concise model that can be used to predict the class label of future, unlabeled records. A very popular class of classifiers are decision trees. All current algorithms to construct decision trees, including all main-memory algorithms, make one scan over the training database per level of the tree.

We introduce a new algorithm (BOAT) for decision tree construction that improves upon earlier algorithms in both performance and functionality. BOAT constructs several levels of the tree in only two scans over the training database, resulting in an average performance gain of 300% over previous work. The key to this performance improvement is a novel optimistic approach to tree construction in which we construct an initial tree using a small subset of the data and refine it to arrive at the final tree. We guarantee that any difference with respect to the ``real'' tree (i.e., the tree that would be constructed by examining all the data in a traditional way) is detected and corrected. The correction step occasionally requires us to make additional scans over subsets of the data; typically, this situation rarely arises, and can be addressed with little added cost.

Beyond offering faster tree construction, BOAT is the first scalable algorithm with the ability to incrementally update the tree with respect to both insertions and deletions over the dataset. This property is valuable in dynamic environments such as data warehouses, in which the training dataset changes over time. The BOAT update operation is much cheaper than completely re-building the tree, and the resulting tree is guaranteed to be identical to the tree that would be produced by a complete re-build.

  • Location: SF1105
  • Time: 11:00am
Roundtable: Meet and talk with the speaker at an informal roundtable discussion.
  • Location: DL 266
  • Time: 2:30pm
[Return to Toronto DB Seminar Index]

March 18, 1999

Title: Intelligent Agents: An Application To Adaptive Web-based Systems
Speaker: Errico Bruno, Etnoteam, Italy
Bruno Errico is currently a consultant for Etnoteam S.p.A., a major Italian private company which provides solutions from Information and Communication Technologies, where he manages several projects in the area of Web applications and Operational Support Systems. He graduated in Electronic Engineering in 1992 at University of Rome "La Sapienza" and got a Ph.D. degree in Computer Science ("Dottorato in Informatica") in 1997 at the same University. The research activity has mainly concerned knowledge representation and Adaptive Interactive Systems. He has addressed the problem of finding minimal models for propositional logics, defining an algorithm for finding prime implicants and determining some complexity results for some approximation versions of the problem. He has worked on problems concerning Adaptive Interactive Systems, i.e., systems that aim at dynamically adapting to the current users. In particular, he worked on the representation of and reasoning about users' mental state based on the current interaction with the system, defining a domain-independent framework to be applied both to User Modeling problems, for interactive systems, and to Student Modeling problems, for Intelligent Tutoring Systems and Intelligent Learning Environments
Abstract: The talk is centered around three main parts. The first part, is devoted to give an introduction to the area of Intelligent Agents. The controversial task of giving a formal definition is carried out by providing several taxonomies of current research on Intelligent Agents, along different significant dimensions. In the second part, Intelligent Agents are applied to adaptive Web-based systems. A framework for devising adaptive Web systems, characterized by a smart interface that exploits personalized characters, is introduced and discussed. Finally, in the third part, we relate this framework to the NECTAR system, an ongoing project for a smart interface for on-line shopping.
  • Location: GB 248
  • Time: 10:00am
[Return to Toronto DB Seminar Index]

March 30, 1999

Title: Large Scale Copy Detection
Speaker: Narayanan Shivakumar, Stanford University
NARAYANAN SHIVAKUMAR is a PhD candidate in the Computer Science Department at Stanford University. He received his B.S. degree in Computer Science and Engineering from University of California, Los Angeles in 1994, and his M.S. degree in Computer Science from Stanford University in 1997. His current research interests include large-scale copy detection algorithms, databases, and digital libraries. He has been a summer visitor at Microsoft Corp., Bell Labs, and Xerox PARC. He is a member of ACM and Tau Beta Pi.
Abstract: Currently, any small time cyber-pirate can make copies of music CDs and books available on the web in digital format to a large audience at virtually no cost. Content publishers such as Disney and Sony Records are therefore expected to lose several billions of dollars over the next few years in copyright revenues. To address this problem, we propose building a copy detection system (CDS), where content publishers will register their valuable digital content. The CDS then crawls the web, compares the web content to the registered content and notifies the content owners of illegal copies. In my talk, I will discuss how to build such a system so it is accurate, scalable (e.g., to hundreds of gigabytes of data, or millions of web pages) and resilient to "attacks" (e.g., partial audio clips) from cyber-pirates. I will also discuss two prototype CDS I have built as "proofs of concept": (1) SCAM (Stanford Copy Analysis Mechanism), for finding textual copies on the web, and (2) FRAUD (Finding Replicas of AUDio) for finding audio copies on the web.
  • Location: SF 1105
  • Time: 11:00am
Roundtable: Meet and talk with the speaker at an informal roundtable discussion.
  • Location: DL 378
  • Time: 3:00pm
[Return to Toronto DB Seminar Index]

April 5, 1999

Title: Safeguarding Digital Library Contents and Users
Speaker: Henry Gladney, IBM Almaden Research Labs
Abstract: The IBM Research Division reports progress towards world-wide access to digital images of art, ancient artifacts, historic manuscripts, and other materials of world-wide significance. Since 1985, we have been working with collections of artistic and historic materials: of the Biblioteca Vaticana Apostolica (Vatican Library), el Archivo General de Indias (Sevilla, Spain), Andrew Wyeth's work, the Klau Library of Hebrew Union College, and the Yale Beinecke Library. A few illustrations suggest the cultural values which motivate the work, which has been directed towards serving scholars.

More recent work is centered on making North American collections accessible to undergraduates and to the public at large. We pay special attention to the intertwined issues of quality representation and intellectual property rights. Funding collection development is key, together with strict compliance with rightsholders' wishes. The notion that "intellectual property is property" is surprisingly controversial, because some people associate this aphorism with asocial objectives. Collection curators have varying usage policies, from very restrictive to quite permissive, but all want their holdings to be tastefully represented and their Internet presentations to project their institutions favorably.

Our Safeguarding ... series in D-Lib Magazine suggests technology to help manage digital intellectual property. That technology can contribute only in a complex of administrative, legal, contractual, and social practices is broadly accepted. Among concerns for responsive and responsible management of intellectual property, technical aspects are surely secondary to prominent issues of public policy, law, and ethics. The latter are beginning to be addressed both in legislative processes and also by academic investigators.

For the technical community, we assert that we can design offerings with sufficient flexibility. We need not wait for policy decisions which might affect software to administer rules chosen or to hinder unacceptable behavior. In the talk we will project technical directions without designing solutions, emphasizing managing the data -- how it is stored, protected, and communicated.

  • Location: DL Pratt 266
  • Time: 11:00am
Roundtable: Meet and talk with the speaker at an informal roundtable discussion.
  • Location: SF2103
  • Time: 2:00pm
[Return to Toronto DB Seminar Index]

May 10, 1999 Monday

Title: Scaling Heterogeneous Information Access for Wide-Area Environments
Speaker: Louiqa Raschid, University of Maryland
Louiqa Raschid received a Bachelor of Technology in electrical engineering from the Indian Institute of Technology, Madras, in 1980, and a Ph.D. in electrical engineering from the University of Florida, Gainesville, in 1987. Since 1987 she has been at the University of Maryland in College Park. She is an Associate Professor in the Smith School of Business. She also holds a joint appointment with the Institute for Advanced Computer Studies and the Department of Computer Science. Dr. Raschid's research interests include database accessibility over the WWW; query processing with networked information servers; semantic query optimization for object and relational databases; and rule processing in database management systems. She is co-director of the Laboratory for Computational Linguistics and Information Processing. Since 1994, she has been a Visiting Scientist at the French National Laboratories for Information Sciences (INRIA). She has also been a Visiting Scientist with Hewlett Packard Research Labs and Stanford Research Institute. She co-chaired a Working Group, sponsored by the Defense Advanced Research Projects Agency and the National Science Foundation, on mediator data models and query languages, in 1996. Dr. Raschid serves on the editorial board of the INFORMS Journal of Computing. Her research is supported by grants from the National Science Foundation and the Defense Advanced Research Projects Agency. She is a member of IEEE, ACM, and the Society of Women Engineers.
Abstract: Much current research in Information Systems is aimed at providing seamless access to data stored in a wide variety of repositories including Web accessible WebSources (enabled by HTTP, XML, HTML). As query processing with such sources are scaled to a wide-area environment such as the Internet, we will encounter significant challenges arising from the huge number of disparate and unreliable repositories and the instability and unreliability of the networks. The scalability problems that must be overcome include: 1) dissimilarities in the capabilities and contents of heterogeneous repositories, which increase the difficulty and expense of generating efficient access plans; 2) the inability to accurately predict response times when accessing remote repositories; and 3) the lack of support for identifying and locating repositories that are relevant to a particular application. We have developed technology to address these problems, including: A toolkit for generating wrappers; A Web Query Optimizer that uses a Wrapper Cost Model and a Web Prediction Tool (WebPT) that predicts response times; Query Scrambling and XJoin: techniques for producing answers quickly in an unpredictable environment, and WebSemantics: a prototype for publishing and locating WebSources using the WWW and XML. In this talk, I will discuss the WebPT, and its use in the Web Query Optimizer. This work is in conjunction with several doctoral students, Dr. Vladimir Zadorozhny and Professor Michael Franklin at the University of Maryland, and the WebSemantics project at the University of Toronto.
  • Location: DL Pratt 378
  • Time: 1:00pm NOTE TIME CHANGE
[Return to Toronto DB Seminar Index]

May 14, 1999 Friday

Title: Searching the Web: It's Worse Than You Thought!
Speaker: C. Lee Giles, NEC Research Institute
Dr. C. Lee Giles is a senior research scientist in Computer Science at NEC Research Institute, Princeton, NJ.; adjunct faculty at the Institute for Advanced Computer Studies at the U. of Maryland; and adjunct Professor in Computer and Information Science at the U. of Pennsylvania. His research interests are in novel applications of neural and machine learning, agents and AI in web computing and in fundamental models of intelligent systems. He is a Fellow of the IEEE and a member of AAAI, ACM, INNS, OSA, AAAS, and the Center for Discrete Mathematics and Theoretical Computer Science, Rutgers University. Recently, he coauthored a paper published in SCIENCE on the size of the web and search engine coverage that received wide press coverage including the Wall St Journal, NY Times, MSNBC, PBS, BBC, National Geographic, etc. His research was recently highlighted in SIAM news and recently taught a graduate class in the Computer and Information Science Dept. at the U. of Pennsylvania on "Information Retrieval, Digital Libraries and the Web."
Abstract: The World Wide Web is a revolution in information dissemination, storage, and access. It has has opened up new possibilities in areas such as general and scientific information dissemination and retrieval, commerce and business, education, government, religion, law, entertainment, and health care. There are many avenues for improvement of the Web, for example in the areas of locating and organizing information. We discuss the effectiveness of Web search engines, including results that show that the major Web search engines cover only a fraction of the ``publicly indexable Web''[1]. Our current research into improved searching of the Web is discussed, including new techniques for ranking the relevance of results, and new techniques in metasearch that can improve the efficiency and effectiveness of Web search[2]. Time permitting the creation of digital libraries incorporating autonomous citation indexing is discussed for improved access to scientific information on the Web[3].

*This is joint work with Steve Lawrence and Kurt Bollacker.

REFERENCES:

[1] S. Lawrence, C.L. Giles, "Searching the World Wide Web," SCIENCE, 280, p 98. 1998.

[2] S. Lawrence, C.L. Giles, "Context and Page Analysis for Improved Web Search," IEEE Internet Computing, 2(4), pp. 38-46, 1998.

[3] C.L. Giles, K. Bollacker, S. Lawrence, "CiteSeer: An Automatic Citation Indexing System, DL'98 Digital Libraries," The 3rd ACM Conference on Digital Libraries, pp. 89-98, 1998.

  • Location: Galbraith 119
  • Time: 11:00am
[Return to Toronto DB Seminar Index]

May 17, 1999 Monday

Title: Trawling the web for cyber-communities: the Campfire project
Speaker: Sridhar Rajagopalan, IBM Almaden Research Labs
Abstract: The web harbors a large number of communities -- groups of content-creators sharing a common interest -- each of which manifests itself as a set of interlinked web pages. Newgroups and commercial web directories together contain of the order of 20000 such communities; our particular interest here is on emerging communities -- those that have little or no representation in such fora. The subject of this talk is the systematic enumeration of over 100,000 such emerging communities from a web crawl: we call our process trawling. We motivate a graph-theoretic approach to locating such communities, and describe the algorithms, and the algorithmic engineering necessary to find structures that subscribe to this notion, the challenges in handling such a huge data set, and the results of our experiment.

We also present a probabilistic model for the evolution of the web graph based on our experimental observations. We show that our algorithms run efficiently in this model, and use the model to explain several statistical phenomena on the web that emerged during our experiments.

This is joint work with Ravi Kumar, Prabhakar Raghavan and Andrew Tomkins.

sridhar@almaden.ibm.com

  • Location: GB 119
  • Time: 11:00am
[Return to Toronto DB Seminar Index]