Hector Garcia-Molina

Standford University

How to Crawl the Web

Abstract

A crawler collects large numbers of web pages, to be used for building an index or for data mining. Crawlers consume significant network and computing resources, both at the visited web servers and at the site(s) collecting the pages, and thus it is critical to make them efficient and well behaved. In this talk I will discuss how to build a "good" crawler, addressing questions such as:

How can a crawler gather "important" pages only?
How can a crawler efficiently maintain its collection "fresh"?
How can a crawler be parallelized?

I will also summarize results from an experiment conducted on more than half million web pages over 4 months, to estimate how web pages evolve over time.

Joint work with Junghoo Cho

Biography

Hector Garcia-Molina is the Leonard Bosack and Sandra Lerner Professor in the Departments of Computer Science and Electrical Engineering at Stanford University, Stanford, California. From August 1994 to December 1997 he was the Director of the Computer Systems Laboratory at Stanford. From 1979 to 1991 he was on the faculty of the Computer Science Department at Princeton University, Princeton, New Jersey. His research interests include distributed computing systems and database systems. He received a BS in electrical engineering from the Instituto Tecnologico de Monterrey, Mexico, in 1974. From Stanford University, Stanford, California, he received in 1975 a MS in electrical engineering and a PhD in computer science in 1979. Garcia-Molina is a Fellow of the ACM, received the 1999 ACM SIGMOD Innovations Award, and is a member of the President's Information Technology Advisory Committee (PITAC).

Host: contact Prof. Mendelzon for information on the speaker's schedule.

Time and Location: return to the 2000 Colloquia Series main page.

University of Toronto
Department of Computer Science

A Distinguished Lecture on Computer Science