


| Title: | Intersection Stacking for Multi-dimensional Aggregation in RDBMSs |
| Speaker: |
Roberta Cochrane, IBM Almaden Research Center, San Jose, CA Roberta Cochrane is a Research Staff Member at the IBM Almaden Research Center in San Jose, California. She received her Ph.D. in Computer Science from the University of Maryland in 1992. Currently, she is involved in research for technologies to support Business Intelligence in relational database management systems. She has also done extensive research in active database systems. She implemented the Starburst Rule System and led the design and implementation of triggers and constraints for both serial and parallel versions of IBMs DB2 Universal Database System. She played a major role in defining the SQL3 standard for triggers and constraints. |
| Abstract: |
Business Intelligence applications perform complex aggregation for
large amounts (typically 1 to 10 Terabytes) of data. This puts
increasing demands on database systems to provide native support for
such processing, often referred to as Online Analytical Processing
(OLAP). SQL has recently extended the group-by clause to provide
primitives for common OLAP computations in the DBMS, allowing the DBMS
more flexibility in processing and optimizing such aggregation. These
computations are the data-cube, roll-up, concatenations of roll-up
(multi-dimensional cube), and combinations of ad-hoc grouping
elements. The specification of the group-by clause can expand into
many grouping sets. For example, the cube alone will result in 2**n
grouping sets where n is the number of grouping elements. In this
talk I will present the SQL OLAP extensions and describe a novel
technique for stacking grouping operations. Our technique results in
linear expansion of grouping sets, greatly reducing the amount of
complexity and resources required to optimize and compute such
queries.
|
| Title: | Web Search Technology |
| Speaker: |
Daniel Ford, IBM Almaden Research Center, San Jose, CA Daniel Ford is the Manager of the Web Technologies Department at the IBM Almaden Research Center in San Jose, California. He received his Ph.D. in Computer Science from the University of Waterloo in 1990. Since 1992 he has been a Research Staff Member at IBM Almaden doing research into Tertiary RAID (RAIL) storage systems, and more recently in Web Search technologies. |
| Abstract: |
Grand Central Station is a system that extends search to all digital
sources of data. The system consists of components to access and
understand data sources and generate searchable metadata
("Gatherers"), a metadata repository for satisfying ad hoc queries,
and an extensible profiling system for processing persistent queries.
The Gatherer is an extensible crawler framework written in Java that
is capable of using a variety of protocols (e.g., http, ftp, nntp,
odbc, cics, pop3) to access and understand a wide range of data
formats (HTML, Java Bytecode, PowerPoint, TAR/Zip archives, and many
others). The Gatherer generates summaries of each data source it
encounters in an instance of XML we call SumML (Summary Metalanguage).
Key features of the Gatherer are its ability to be easily extended by
adding protocol and data source specific code, and its ability to run,
unchanged, on any platform that supports Java.
The metadata repository is less advanced, but the Profiling framework
is moving forward to encompass multimedia profiling.
The system has been deployed in the form of a Java specific search
engine called "jCentral" and is accessible from IBM's Java home page
(http://www.ibm.com/java).
This talk will present and demonstrate the system, and discuss future
directions into searching video, image and audio.
|
| Title: | Research Projects in Data Warehousing, Data Mining and Heterogeneous DBMS |
| Speaker: |
Renée J. Miller, University of Toronto
|
| Abstract: | A
common requirement among many of today's data-intensive applications
is the need to efficiently manage and analyze large volumes of
semi-structured, heterogeneous data. It is this task that lies at the
heart of my research agenda. Of particular interest are the tasks
required to support data warehousing. Data warehouses provide
integrated access to historical data collected from legacy data
sources for use in decision support and data analysis. Data
warehousing uses techniques developed in a variety of areas including:
heterogeneous DBMS-the management and integration of
heterogeneous schemas and data; database publishing-the
browsing and querying of complex, structured or semi-structured
databases to users who are unfamiliar with the data and its
organization; and data mining-the search for patterns in data
that are useful in data analysis and data reduction. I will overview
my past and on-going work in some of these areas.
|
| Title: | Hierarchical Data Management |
| Speaker: |
H.V. Jagadish, AT&T Labs
|
| Abstract: |
Much of the data we deal with every day is organized hierarchically: file
systems, library classification schemes and yellow page categories are
salient examples. Yet, commercial databases use a flat relational model
that does not conveniently accommodate hierarchies. The question we seek
to address is whether it is possible to combine the benefits of a
hierarchical representation of data (better support for heterogeneity,
autonomy, scaling) with the strong points of the relational data model
(clean data model, declarative query language).
We do so with particular focus on directory servers and in particular LDAP (lightweight directory access protocol). We develop a hierarchical data model based on LDAP, and explore some possibilities for a query language.
|
| Roundtable: |
Meet and talk with the speaker at an informal roundtable discussion.
|
| Title: | Exploratory Association Rule Mining with Constraints |
| Speaker: |
Raymond Ng, Univ. of British Columbia
|
| Abstract: |
From the standpoint of supporting human-centered discovery of knowledge,
the present-day model of mining association rules suffers from the
following serious shortcomings: (i) lack of user exploration and control,
(ii) lack of focus, and (iii) rigid notion of relationships. In effect,
this model functions as a black-box, admitting little user interaction
in between. We propose, in this talk, an architecture that opens up
the black-box, and supports constraint-based, human-centered exploratory
mining of associations. The foundation of this architecture is a rich set
of constraint constructs, including domain, class, and SQL-style aggregate
constraints, which enable users to clearly specify what associations are
to be mined. We propose constrained association queries as a means
of specifying the constraints to be satisfied by the antecedent and
consequent of a mined association.
In this talk, we mainly focus on the technical challenges in guaranteeing a level of performance that is commensurate with the selectivities of the constraints in an association query. To this end, we introduce and analyze two properties of constraints that are critical to pruning: anti-monotonicity and succinctness. We then develop characterizations of various constraints into four categories, according to these properties. Finally, we describe a mining algorithm called CAP, which achieves a maximized degree of pruning for all categories of constraints. Experimental results indicate that CAP can run much faster, in some cases as much as 80 times, than several basic algorithms. This demonstrates how important the succinctness and anti-monotonicity properties are, in delivering the performance guarantee. To conclude, we discuss the implications of this work on the issue of how to integerate association rule mining with DBMS.
|
| Roundtable: |
Meet and talk with the speaker at an informal roundtable lunch
discussion. Pizza will be served.
|
| Title: | DISIMA Project: A Distributed, Interoperable Image Database System |
| Speaker: |
M. Tamer Özsu,
Univ. of Alberta
|
| Abstract: |
This talk describes the DISIMA project currently under investigation at the
Laboratory for Database Systems Research of the University of Alberta.
DISIMA project aims at building an image database system enabling
content-based querying. The main emphasis of the work is content-based
indexing and querying of images, in particular with respect to spatial
relationships. The identifying characteristics of our project are (a)
object-oriented approach to image data management, (b) use of image
processing and indexing techniques for efficient querying and access to
image databases, and (c) interoperability among various image storage
systems. DISIMA model allows the user to assign diffent semantics to an
image component (semantic independence) and an image representation can be
changed without any effect on applications using it (representation
independence). The architecture involves different image sources including
WWW-servers and file systems. The system provides an OQL-based multimedia
query language for accessing images (and video). DISIMA is currently being
implemented on top of Objectstore. The talk will focus on system
architecture, image modeling, type system design, query language and
indexing issues.
|
| Roundtable: |
Meet and talk with the speaker at an informal roundtable
discussion.
|
| Title: | Distinguished Lecture Series: Array Databases |
| Speaker: |
Ken Salem, University
of Waterloo
Ken Salem is an Associate Professor at the University of Waterloo and director of the Distributed Batch Controller Project. |
| Abstract: |
Arrays are an appropriate data model for images, gridded
output from computational models, and other types of data.
This talk will describe an array database system based on a
multidimensional data model and a simple query algebra called AML.
To write expressions in AML, it is first necessary to supply a set of
domain-specific functions. AML can then be used to apply
those functions to arrays in a structured way. Examples will be
used to illustrate some useful things that can be written in AML,
and to show how AML expressions can be treated declaratively and
optimized by a database system.
|
| Title: | Content-Based Organization of the Information Space in Multi-Database Networks |
| Speaker: |
Mike Papazoglou, Tilberg Univ.
Professor Papazoglou is the director of INFOLAB at Tiburg University in the Netherlands. His interests include information systems, data modelling, distributed information services, and agent-based systems. He has recently editted a book on Cooperative Information Systems and another one on Object-Oriented Data Modelling. |
| Abstract: |
Rapid growth in the volume of network-available data, complexity,
diversity and terminological fluctuations, at different data sources,
render network-accessible information increasingly difficult to
achieve. The situation is particularly cumbersome for users of
multi-database systems who are expected to have prior detailed
knowledge of the definition and uses of the information content in
these systems.
This talk will describe a conceptual organization of the information space across collections of component systems in multi-databases that provides serendipity, exploration and contextualization support. In this way users can achieve logical connections between concepts they are familiar with and schema terms employed in multi-database systems. Large-scale searching for multi-database schema information is guided by a combination of lexical, structural and semantic aspects of schema terms in order to reveal more meaning both about the contents of a requested information term and about its placement within the distributed information space.
|
| Roundtable: |
Meet and talk with the speaker at an informal roundtable
discussion.
|
| Title: | DATABASE SYSTEMS: Trends in Database System Development |
| Speaker: |
Vidojko Ciric, University of Belgrade
Dr. Vidojko Ciric is a professor of computer science at the University of Belgrade. He received a Ph.D. degree in 1969 from Rice University, EE Department, Houston, Texas, USA. He was granted a research scholarship from NSF & NASA and had an active part in the Apollo 8 flight simulation project at the University. His research interests include Software engineering, OO programming, Database systems and Case tool design. His publication list includes more than 160 refereed journal or conference papers and 20 books in system and computer science. According to a survey related to the Guttenberg anniversary, the book "General Sensitivity Theory" by Tomovic and Vukobratovic, in which Prof. Ciric has written a chapter: "A New Controllability Concept in Sensitivity Design of Optimal Control Systems", published by American Elsevier, 1973, belongs to the group of 200 most referenced technical books in engineering. He was a referee for IFAC Automatica and IEEE Computer Society. Currently, he is an active referee and a member of program committee of two IASTED Conferences on: Software Engineering and Applied Informatics. He organized and was a chairmen of special session: Issues in OO Design and Programming at the IASTED SE'98 Conference in Las Vegas. |
| Details: |
|
| Title: | Index Structures for Path Expressions |
| Speaker: |
Dan Suciu, AT&T Research
Dan Suciu is a member of the technical staff at AT&T Labs. He is working on semistructured data, and has been involved in several projects of semistructured data (UnQL, Strudel, XML-QL). He received his PhD from the University of Pennsylvania in 1995, and his BS from the Polytechnic of Bucharest (Romania). |
| Abstract: |
Queries over semistructured databases contain regular path
expressions. Their naive computation is prohibitively high, since in
most cases they require the traversal of the entire database. We
propose a novel and general index structure for computing regular path
expressions on semistructured data, called T-index, whose main
features are: (1) it can trade-off space for generality, (2) it can be
always efficiently computed (in PTIME over the database), and (3) it
is provably space efficient. T-indexes generalize several known index
structures, such as: data-guides (for semistructured data), Access
Support Relations (for OODBs), and Pat trees (for full text indexes).
Joint work with Tova Milo (Tel Aviv University)
|
| Roundtable: |
Meet and talk with the speaker at an informal roundtable
discussion.
|
| Title: | A Deterministic Model for Semi-Structured Data |
| Speaker: |
Peter
Buneman, University of Pennsylvania
|
| Abstract: |
This is a preliminary report on a new model for semi-structured data.
The idea of semi-structured data evolved, in part, from various
syntactic representations of data such as AceDB, OEM and various data
formats and it has more recently been used effectively to design query
languages for XML. Insofar as there is an agreed model, it is simply
an edge-labeled graph. However this description begs a number of
important questions: What can constitute an edge label? Are there
values associated with the vertices? Is there a separate labeling
system for vertices to provide them with independent identity?
We describe here a new model for semi-structured data. It is more
restrictive than the recently described models in that it is
deterministic. The edges emanating from any node in the graph have
distinct labels. It is less restrictive in that the edges can carry
data and may have structure. In fact they may themselves be small
pieces of semi-structured data. The advantage of this approach is
that each component of the database is uniquely identified by a path.
Paths serve as object identifiers or l-values; but unlike object
identifiers, paths also have structure, and a number of useful
database operations may be obtained by manipulation of this structure.
The motivation for this model came in part from the need to develop
an annotation system for "curated" databases. While the databases
typically have a well-defined and rich structure, annotations are
arbitrary and unpredictable, and they require some form of
semi-structured approach.
Work with Alin Deautsch and Wang-Chiew Tan of the University of
Pennsylvania.
|
| Roundtable: |
Meet and talk with the speaker at an informal roundtable
discussion.
|
| Title: | Tableau Techniques for Querying Information Sources Through Global Schemas |
| Speaker: |
Gösta
Grahne, Concordia University
|
| Abstract: | The
foundational homomorphism techniques introduced by Chandra and Merlin
for testing containment of conjunctive queries have recently attracted
renewed interest due to their central role in information integration
applications. We show that generalizations of the classical tableau
representation of conjunctive queries are useful for computing query
answers in information integration systems where information sources
are modeled as views defined on a virtual global schema. We consider
a general situation where sources may or may not be known to be
correct and complete. We characterize the set of answers to a global
query and give algorithms to compute a finite representation of this
possibly infinite set, as well as its certain and possible
approximations. We show how to rewrite a global query in terms of the
sources in two special cases, and show that one of these is equivalent
to the Information Manifold rewrite of Levy et al.
|
| Title: | Distinguished Lecture Series : Automated Verification = Graphs, Automata, and Logic |
| Speaker: |
Moshe Vardi, Rice University
Moshe Y. Vardi is a Noah Harding Professor of Computer Science and Chair of Computer Science at Rice University. Prior to joining Rice in 1993, he was at the IBM Almaden Research Center, where he managed the Mathematics and Related Computer Science Department. His research interests include database systems, computational-complexity theory, multi-agent systems, and design specification and verification. Vardi received his Ph.D. from the Hebrew University of Jerusalem in 1981. He is the author and co-author of over 100 technical papers, as well as a book titled "Reasoning about Knowledge". Vardi is the recipient of 3 IBM Outstanding Innovation Awards. He is an editor of several international journals and currently serves as the General Chair of the Federated Logic Conference. |
| Abstract: |
In automated verification one uses algorithmic techniques to
establish the correctness of the design with respect to a given
property. Automated verification is based on a small number of
key algorithmic ideas, tying together graph theory, automata
theory, and logic. In this self-contained talk I will describe how
this "holy trinity" gave rise to automated-verification tools.
|
| Roundtable: |
Meet and talk with the speaker at an informal roundtable
discussion.
|
| Title: | Myths and Realities in Cluster I/O |
| Speaker: |
Remzi Arpaci-Dusseau,
University of California, Berkeley Remzi H. Arpaci-Dusseau is currently a graduate student at U.C. Berkeley, under advisor David Patterson. He received a B.S. in Computer Engineering, summa cum laude, from the University of Michigan in 1993, and a Masters in Computer Science from U.C. Berkeley in 1996. He plans to complete his dissertation work in the fall of 1999. His interests lay largely in the area of experimental distributed and parallel systems, including operating systems, file systems, databases, and computer architecture. His most recent work has been on River, a software system designed to provide consistent, high-performance for cluster applications with large I/O demands. He and his wife, Andrea Arpaci-Dusseau, broke and still hold two world records in external sorting. For more information, see: http://www.cs.berkeley.edu/~remzi |
| Abstract: |
In this talk, I will discuss three myths, or popular beliefs, that
have grown around clusters of workstations, especially in the arena of
high-performance I/O. In discussing the validity of these myths, I
will answer three basic questions:
|
| Title: | Implicit Coscheduling: Coordinated Scheduling with Implicit Information in Distributed Systems |
| Speaker: |
Andrea
Arpaci-Dusseau, University of California, Berkeley
|
| Abstract: |
Building fault-tolerant, scalable services in a distributed system
has typically involved complex implementations. We believe that
implicit control can greatly simplify the construction of such
services. In an implicitly-controlled system, cooperating
components do not explicitly contact other components for control
or state information; instead, components infer remote state by
observing naturally-occurring local events and their corresponding
implicit information, i.e., information available outside of a
defined interface.
To concretely demonstrate the advantages of implicit control, we propose implicit coscheduling, an algorithm for dynamically coordinating the time-sharing of communicating processes across distributed machines. Coordinated scheduling, required for communicating processes to leverage the performance benefits of switch-based networks and low overhead protocols, has traditionally been achieved with explicit coscheduling; however, implementations of explicit coscheduling often suffer from multiple failure points and interact poorly with client-server, interactive, and I/O-intensive jobs. With implicit coscheduling, processes in a general-purpose workload can coordinate their own scheduling by simply reacting to implicit information, such as the round-trip time and arrival rate of messages. In this talk, we describe the two principle components of implicit coscheduling: a fair, preemptive operating system scheduler and conditional two-phase waiting, a generalization of traditional two-phase waiting in which spin-time is increased depending upon events that occur while the process waits. We show through both simulation and an implementation on a cluster of 32 workstations that implicit coscheduling efficiently and fairly handles competing applications with a wide range of communication characteristics. For relevant papers and more information, see http://now.CS.Berkeley.EDU/Implicit
|
| Title: | Scalable Decision Tree Construction |
| Speaker: |
Johannes Gehrke,
University of Wisconsin, Madison
|
| Abstract: |
Classification is an important data mining problem. Given a training
database of records, each tagged with a class label, the goal of
classification is to build a concise model that can be used to predict
the class label of future, unlabeled records. A very popular class of
classifiers are decision trees. All current algorithms to construct
decision trees, including all main-memory algorithms, make one scan
over the training database per level of the tree.
We introduce a new algorithm (BOAT) for decision tree construction that improves upon earlier algorithms in both performance and functionality. BOAT constructs several levels of the tree in only two scans over the training database, resulting in an average performance gain of 300% over previous work. The key to this performance improvement is a novel optimistic approach to tree construction in which we construct an initial tree using a small subset of the data and refine it to arrive at the final tree. We guarantee that any difference with respect to the ``real'' tree (i.e., the tree that would be constructed by examining all the data in a traditional way) is detected and corrected. The correction step occasionally requires us to make additional scans over subsets of the data; typically, this situation rarely arises, and can be addressed with little added cost. Beyond offering faster tree construction, BOAT is the first scalable algorithm with the ability to incrementally update the tree with respect to both insertions and deletions over the dataset. This property is valuable in dynamic environments such as data warehouses, in which the training dataset changes over time. The BOAT update operation is much cheaper than completely re-building the tree, and the resulting tree is guaranteed to be identical to the tree that would be produced by a complete re-build.
|
| Roundtable: |
Meet and talk with the speaker at an informal roundtable
discussion.
|
| Title: | Intelligent Agents: An Application To Adaptive Web-based Systems |
| Speaker: |
Errico Bruno, Etnoteam, Italy
Bruno Errico is currently a consultant for Etnoteam S.p.A., a major Italian private company which provides solutions from Information and Communication Technologies, where he manages several projects in the area of Web applications and Operational Support Systems. He graduated in Electronic Engineering in 1992 at University of Rome "La Sapienza" and got a Ph.D. degree in Computer Science ("Dottorato in Informatica") in 1997 at the same University. The research activity has mainly concerned knowledge representation and Adaptive Interactive Systems. He has addressed the problem of finding minimal models for propositional logics, defining an algorithm for finding prime implicants and determining some complexity results for some approximation versions of the problem. He has worked on problems concerning Adaptive Interactive Systems, i.e., systems that aim at dynamically adapting to the current users. In particular, he worked on the representation of and reasoning about users' mental state based on the current interaction with the system, defining a domain-independent framework to be applied both to User Modeling problems, for interactive systems, and to Student Modeling problems, for Intelligent Tutoring Systems and Intelligent Learning Environments |
| Abstract: |
The talk is centered around three main parts.
The first part, is devoted to give an introduction to the area of
Intelligent Agents. The controversial task of giving a formal definition
is carried out by providing several taxonomies of current research on
Intelligent Agents, along different significant dimensions.
In the second part, Intelligent Agents are applied to adaptive Web-based
systems. A framework for devising adaptive Web systems, characterized
by a smart interface that exploits personalized characters, is
introduced and discussed.
Finally, in the third part, we relate this framework to the NECTAR
system, an ongoing project for a smart interface for on-line shopping.
|
| Title: | Large Scale Copy Detection |
| Speaker: |
Narayanan Shivakumar,
Stanford University
NARAYANAN SHIVAKUMAR is a PhD candidate in the Computer Science Department at Stanford University. He received his B.S. degree in Computer Science and Engineering from University of California, Los Angeles in 1994, and his M.S. degree in Computer Science from Stanford University in 1997. His current research interests include large-scale copy detection algorithms, databases, and digital libraries. He has been a summer visitor at Microsoft Corp., Bell Labs, and Xerox PARC. He is a member of ACM and Tau Beta Pi. |
| Abstract: |
Currently, any small time cyber-pirate can make copies of music CDs and
books available on the web in digital format to a large audience at
virtually no cost. Content publishers such as Disney and Sony Records
are therefore expected to lose several billions of dollars over the next
few years in copyright revenues. To address this problem, we propose
building a copy detection system (CDS), where content publishers will
register their valuable digital content. The CDS then crawls the web,
compares the web content to the registered content and notifies the
content owners of illegal copies. In my talk, I will discuss how to
build such a system so it is accurate, scalable (e.g., to hundreds of
gigabytes of data, or millions of web pages) and resilient to
"attacks" (e.g., partial audio clips) from cyber-pirates. I will also
discuss two prototype CDS I have built as "proofs of concept":
(1) SCAM (Stanford Copy Analysis Mechanism), for finding textual copies
on the web, and (2) FRAUD (Finding Replicas of AUDio) for finding audio
copies on the web.
|
| Roundtable: |
Meet and talk with the speaker at an informal roundtable
discussion.
|
| Title: | Safeguarding Digital Library Contents and Users |
| Speaker: |
Henry Gladney, IBM Almaden Research Labs
|
| Abstract: |
The IBM Research Division reports progress towards world-wide access
to digital images of art, ancient artifacts, historic manuscripts, and
other materials of world-wide significance. Since 1985, we have been
working with collections of artistic and historic materials: of the
Biblioteca Vaticana Apostolica (Vatican Library), el Archivo General
de Indias (Sevilla, Spain), Andrew Wyeth's work, the Klau Library of
Hebrew Union College, and the Yale Beinecke Library. A few
illustrations suggest the cultural values which motivate the work,
which has been directed towards serving scholars.
More recent work is centered on making North American collections accessible to undergraduates and to the public at large. We pay special attention to the intertwined issues of quality representation and intellectual property rights. Funding collection development is key, together with strict compliance with rightsholders' wishes. The notion that "intellectual property is property" is surprisingly controversial, because some people associate this aphorism with asocial objectives. Collection curators have varying usage policies, from very restrictive to quite permissive, but all want their holdings to be tastefully represented and their Internet presentations to project their institutions favorably. Our Safeguarding ... series in D-Lib Magazine suggests technology to help manage digital intellectual property. That technology can contribute only in a complex of administrative, legal, contractual, and social practices is broadly accepted. Among concerns for responsive and responsible management of intellectual property, technical aspects are surely secondary to prominent issues of public policy, law, and ethics. The latter are beginning to be addressed both in legislative processes and also by academic investigators. For the technical community, we assert that we can design offerings with sufficient flexibility. We need not wait for policy decisions which might affect software to administer rules chosen or to hinder unacceptable behavior. In the talk we will project technical directions without designing solutions, emphasizing managing the data -- how it is stored, protected, and communicated.
|
| Roundtable: |
Meet and talk with the speaker at an informal roundtable
discussion.
|
| Title: | Scaling Heterogeneous Information Access for Wide-Area Environments |
| Speaker: |
Louiqa Raschid,
University of Maryland
Louiqa Raschid received a Bachelor of Technology in electrical engineering from the Indian Institute of Technology, Madras, in 1980, and a Ph.D. in electrical engineering from the University of Florida, Gainesville, in 1987. Since 1987 she has been at the University of Maryland in College Park. She is an Associate Professor in the Smith School of Business. She also holds a joint appointment with the Institute for Advanced Computer Studies and the Department of Computer Science. Dr. Raschid's research interests include database accessibility over the WWW; query processing with networked information servers; semantic query optimization for object and relational databases; and rule processing in database management systems. She is co-director of the Laboratory for Computational Linguistics and Information Processing. Since 1994, she has been a Visiting Scientist at the French National Laboratories for Information Sciences (INRIA). She has also been a Visiting Scientist with Hewlett Packard Research Labs and Stanford Research Institute. She co-chaired a Working Group, sponsored by the Defense Advanced Research Projects Agency and the National Science Foundation, on mediator data models and query languages, in 1996. Dr. Raschid serves on the editorial board of the INFORMS Journal of Computing. Her research is supported by grants from the National Science Foundation and the Defense Advanced Research Projects Agency. She is a member of IEEE, ACM, and the Society of Women Engineers. |
| Abstract: |
Much current research in Information Systems is aimed at providing
seamless access to data stored in a wide variety of repositories
including Web accessible WebSources (enabled by HTTP, XML, HTML). As
query processing with such sources are scaled to a wide-area
environment such as the Internet, we will encounter significant
challenges arising from the huge number of disparate and unreliable
repositories and the instability and unreliability of the networks.
The scalability problems that must be overcome include:
1) dissimilarities in the capabilities and contents of heterogeneous
repositories, which increase the difficulty and expense of generating
efficient access plans; 2) the inability to accurately predict
response times when accessing remote repositories; and 3) the lack of
support for identifying and locating repositories that are relevant to
a particular application.
We have developed technology to address these problems, including: A
toolkit for generating wrappers; A Web Query Optimizer that uses a
Wrapper Cost Model and a Web Prediction Tool (WebPT) that predicts
response times; Query Scrambling and XJoin: techniques for producing
answers quickly in an unpredictable environment, and WebSemantics: a
prototype for publishing and locating WebSources using the WWW and
XML. In this talk, I will discuss the WebPT, and its use in the Web
Query Optimizer.
This work is in conjunction with several doctoral students,
Dr. Vladimir Zadorozhny and Professor Michael Franklin at the
University of Maryland, and the WebSemantics project
at the University of Toronto.
|
| Title: | Searching the Web: It's Worse Than You Thought! |
| Speaker: |
C. Lee Giles, NEC Research Institute
Dr. C. Lee Giles is a senior research scientist in Computer Science at NEC Research Institute, Princeton, NJ.; adjunct faculty at the Institute for Advanced Computer Studies at the U. of Maryland; and adjunct Professor in Computer and Information Science at the U. of Pennsylvania. His research interests are in novel applications of neural and machine learning, agents and AI in web computing and in fundamental models of intelligent systems. He is a Fellow of the IEEE and a member of AAAI, ACM, INNS, OSA, AAAS, and the Center for Discrete Mathematics and Theoretical Computer Science, Rutgers University. Recently, he coauthored a paper published in SCIENCE on the size of the web and search engine coverage that received wide press coverage including the Wall St Journal, NY Times, MSNBC, PBS, BBC, National Geographic, etc. His research was recently highlighted in SIAM news and recently taught a graduate class in the Computer and Information Science Dept. at the U. of Pennsylvania on "Information Retrieval, Digital Libraries and the Web." |
| Abstract: |
The World Wide Web is a revolution in information dissemination,
storage, and access. It has has opened up new possibilities in areas
such as general and scientific information dissemination and
retrieval, commerce and business, education, government, religion,
law, entertainment, and health care. There are many avenues for
improvement of the Web, for example in the areas of locating and
organizing information. We discuss the effectiveness of Web search
engines, including results that show that the major Web search engines
cover only a fraction of the ``publicly indexable Web''[1]. Our
current research into improved searching of the Web is discussed,
including new techniques for ranking the relevance of results, and new
techniques in metasearch that can improve the efficiency and
effectiveness of Web search[2]. Time permitting the creation of digital
libraries incorporating autonomous citation indexing is discussed for
improved access to scientific information on the Web[3].
*This is joint work with Steve Lawrence and Kurt Bollacker. REFERENCES: [1] S. Lawrence, C.L. Giles, "Searching the World Wide Web," SCIENCE, 280, p 98. 1998. [2] S. Lawrence, C.L. Giles, "Context and Page Analysis for Improved Web Search," IEEE Internet Computing, 2(4), pp. 38-46, 1998. [3] C.L. Giles, K. Bollacker, S. Lawrence, "CiteSeer: An Automatic Citation Indexing System, DL'98 Digital Libraries," The 3rd ACM Conference on Digital Libraries, pp. 89-98, 1998.
|
| Title: | Trawling the web for cyber-communities: the Campfire project |
| Speaker: |
Sridhar Rajagopalan, IBM Almaden Research Labs
|
| Abstract: | The
web harbors a large number of communities -- groups of
content-creators sharing a common interest -- each of which manifests
itself as a set of interlinked web pages. Newgroups and commercial
web directories together contain of the order of 20000 such
communities; our particular interest here is on emerging communities
-- those that have little or no representation in such fora. The
subject of this talk is the systematic enumeration of over 100,000
such emerging communities from a web crawl: we call our process
trawling. We motivate a graph-theoretic approach to locating such
communities, and describe the algorithms, and the algorithmic
engineering necessary to find structures that subscribe to this
notion, the challenges in handling such a huge data set, and the
results of our experiment.
We also present a probabilistic model for the evolution of the web graph based on our experimental observations. We show that our algorithms run efficiently in this model, and use the model to explain several statistical phenomena on the web that emerged during our experiments. This is joint work with Ravi Kumar, Prabhakar Raghavan and Andrew Tomkins. sridhar@almaden.ibm.com
|