Dialectic: How WebCrawler Crawls?

In the internet there are hundreds of millions of pages providing the information on an amazing variety of topics. So, retrieving the useful information from the web is really a daunting task. How to obtain the required information from those millions of pages? Of course internet search engine site like google.com, yahoo.com, live.com etc are one and only option. These are special sites on the Web that are designed to help people find information stored on other sites. At the first glance it seems like a magic .These site understand what we intended to search. Really amazing, Search engine can be Crawler-Based Search Engines and Human-powered directories. Crawler-based search engines create their listings automatically. It automatically tracks any changes on the web pages where as a human-powered directory depends on humans for its listings. So, in the rapidly growing web, Crawler-Based Search Engine is better.

Crawler-based search engines have three major steps.

a) Crawling

b) Indexing

c) Searching

Crawling:

Web crawlers are programs that locate and gather information on the web. They recursively follow hyperlinks present in known document to find other document. The usual starting points are lists of heavily used servers and very popular pages. In this way, the spider system quickly begins to travel, spreading out across the most widely used portions of the Web. The spider visits to the site on a regular basis, such as every month or two, to look for changes.

Indexing:

An index helps to find the information as quickly as possible. The index is also known as catalog. If a web page changes, then index is updated with new information. Indexing basically consists of two steps:

a) Parsing

b) Hashing

a. Parsing:

Parser extracts the link for further crawling. It also removes tag, JavaScript, comments etc. from the web pages and convert the html document to plain text. For the automated analysis of the text Regular expressions are extensively used. Parser which is designed to run on the entire Web must handle a huge array of possible errors.

b. Hashing:

After each document is parsed, it is encoded into a number. For hashing, a formula known as hashing function is applied to attach a numerical value to a word. So, every word is converted into a wordID by using hash function. Inverted index is used to maintain the relationship between WordID and DocID which helps to quickly find the document containing the given word.

Searching:

All the documents matching the index are not equally relevant. Among the millions of documents only the most relevant documents have to be listed. In the simplest case, a search engine could just store the word and the URL where it was found. In reality, this would make for an engine of limited use, since there would be no way of telling whether the word was used in an important or a trivial way on the page, whether the word was used once or many times or whether the page contained links to other pages containing the word. So, to provide quality search results efficiently, searching process has to complete following steps

· Parse the query.

· Convert words into wordIDs using hash function.

· Compute the rank of that document for the query.

· Sort the documents by rank.

· List only the top N numbers of documents.

For those who are interested in the implementation of the web crawler, check out any of the open source crawler listed below:

Heritrix is the Internet Archive's archival-quality crawler, designed for archiving periodic snapshots of a large portion of the Web (written in Java).

ht://Dig includes a Web crawler in its indexing engine.(Written in C)

Larbin a simple web Crawler(Written in c)

Nutch is a scalable crawler written in Java and released under an Apache License. It can be used in conjunction with the Lucene text indexing package.

WIRE - Web Information Retrieval Environment (Baeza-Yates and Castillo, 2002) is a web crawler written in C++ and released under the GPL, including several policies for scheduling the page downloads and a module for generating reports and statistics on the downloaded pages so it has been used for Web characterization.

Ruya Ruya is an Open Source, breadth-first, level-based web crawler written in python.

Universal Information Crawler Simple web crawler, ritten in Python.

DataparkSearch is a crawler and search engine released under the GNU General Public License.

Dialectic

AdSense

Tuesday, October 9, 2007

How WebCrawler Crawls?

Indexing:

No comments:

Followers