Google September 1998 by Larry Page and Sergey Brin PageRank Named after Larry Page, not a webpage Link A->B is a vote by page A for page B Counts number of votes and weight number - how many votes a page got weight - votes from pages with high rank weigh more Text matching techniques -advanced analysis -# of times search item appears on page -content of page - content of pages linking to the current page PageRank is automated - hard to tamper with, one cannot buy a higher PageRank Google database: ~3 billion URLs Load: 200 million search requests per day Provide 75% of external referrals for an avergae site Favors pages where search terms are near each other Query process: User interacts with Google webserver * shows user the webpage * take the keyword input, formulates query Webserver send the query to the Index Servers * this is the meat of Google - context matching and ranking is done here * this is a huge, very powerful server cluster - needs to do a lot of calculations and data processing Index Server send the results to Document Servers * these store the cached version of the pages * snippets of pages are generated, sent back to webserver for display to user * webpages are pre-cached by "spider" software Bad Sides of Google (http://www.google-watch.org - most of this site is conspiracy theory, but they do bring up some valid points) * takes about a week for a full web update - too slow for news sites * sites with high PageRanks are more important, but also less likely to link to new sites - hard to get acceptance for new stuff - this is BY FAR the biggest problem * problem with deep sites - link go to front page, but the info is in subpages, which have lower rank, and not as likely to be crawled * people are reluctant to put external links, want to keep the crawler on the site longer Competition: Inktomi (possible will be used *again* by Yahoo) Founded in 1996, results of research project at UC Berkeley * allows pay for inclusion * paid customers get 48 hr refresh, and can write their own summaries * paid customers can include/exclude themselves from searches from certain regions * has a crawler * the rest get reviewed ev. 10-14 days * index also about 3 billion pages * summary of page keyed to search result - not always same as Google's * indexes full text, but drops common words * uses *human test* to check effectiveness - this is rather unique * spider weighs title and description heavily, deep crawls rare * some sites are visited purely to construct a link map, and are not indexed * site can be dropped if a different site with simiarl content seems of higher quality, but likely to remain if many links go to the site * Best Of Web sites - very popular sites, persistent in index, cannot pay to get here General Talk about Search Engines http://www.webreference.com/content/search/