AdSense

Tuesday, October 9, 2007

How WebCrawler Crawls?

In the internet there are hundreds of millions of pages providing the information on an amazing variety of topics. So, retrieving the useful information from the web is really a daunting task. How to obtain the required information from those millions of pages? Of course internet search engine site like google.com, yahoo.com, live.com etc are one and only option. These are special sites on the Web that are designed to help people find information stored on other sites. At the first glance it seems like a magic .These site understand what we intended to search. Really amazing, Search engine can be Crawler-Based Search Engines and Human-powered directories. Crawler-based search engines create their listings automatically. It automatically tracks any changes on the web pages where as a human-powered directory depends on humans for its listings. So, in the rapidly growing web, Crawler-Based Search Engine is better.

Crawler-based search engines have three major steps.

a) Crawling

b) Indexing

c) Searching

Crawling:

Web crawlers are programs that locate and gather information on the web. They recursively follow hyperlinks present in known document to find other document. The usual starting points are lists of heavily used servers and very popular pages. In this way, the spider system quickly begins to travel, spreading out across the most widely used portions of the Web. The spider visits to the site on a regular basis, such as every month or two, to look for changes.

Indexing:

An index helps to find the information as quickly as possible. The index is also known as catalog. If a web page changes, then index is updated with new information. Indexing basically consists of two steps:

a) Parsing

b) Hashing

a. Parsing:

Parser extracts the link for further crawling. It also removes tag, JavaScript, comments etc. from the web pages and convert the html document to plain text. For the automated analysis of the text Regular expressions are extensively used. Parser which is designed to run on the entire Web must handle a huge array of possible errors.

b. Hashing:

After each document is parsed, it is encoded into a number. For hashing, a formula known as hashing function is applied to attach a numerical value to a word. So, every word is converted into a wordID by using hash function. Inverted index is used to maintain the relationship between WordID and DocID which helps to quickly find the document containing the given word.

Searching:

All the documents matching the index are not equally relevant. Among the millions of documents only the most relevant documents have to be listed. In the simplest case, a search engine could just store the word and the URL where it was found. In reality, this would make for an engine of limited use, since there would be no way of telling whether the word was used in an important or a trivial way on the page, whether the word was used once or many times or whether the page contained links to other pages containing the word. So, to provide quality search results efficiently, searching process has to complete following steps

· Parse the query.

· Convert words into wordIDs using hash function.

· Compute the rank of that document for the query.

· Sort the documents by rank.

· List only the top N numbers of documents.

For those who are interested in the implementation of the web crawler, check out any of the open source crawler listed below:

Heritrix is the Internet Archive's archival-quality crawler, designed for archiving periodic snapshots of a large portion of the Web (written in Java).

ht://Dig includes a Web crawler in its indexing engine.(Written in C)

Larbin a simple web Crawler(Written in c)

Nutch is a scalable crawler written in Java and released under an Apache License. It can be used in conjunction with the Lucene text indexing package.

WIRE - Web Information Retrieval Environment (Baeza-Yates and Castillo, 2002) is a web crawler written in C++ and released under the GPL, including several policies for scheduling the page downloads and a module for generating reports and statistics on the downloaded pages so it has been used for Web characterization.

Ruya Ruya is an Open Source, breadth-first, level-based web crawler written in python.

Universal Information Crawler Simple web crawler, ritten in Python.

DataparkSearch is a crawler and search engine released under the GNU General Public License.

Python in Brief

Python:

Python is a general purpose object-oriented, high-level interpreted language. Python was originally developed in the early '90s by Guido van Rossum. His original goal was to develop a language that stresses in readability, simplicity and elegance. Python runs on all major hardware platforms and operating systems, so it doesn't constrain your platform choices.

Python offers high productivity for all phases of the software life cycle: analysis, design, prototyping, coding, testing, debugging, tuning, documentation, deployment, and, of course, maintenance. Python is easy to learn, so it is quite suitable to anyone new to programming, yet at the same time it is powerful enough for the sophisticated expert. There are many sophisticated libraries available which make the programming in python even more equipped. The combination of simplicity, power and portability, along with its open-source nature, has made Python extremely popular.

Who uses python?

Python is used extensively for system administration tasks (it is, for example, a vital component of several Linux distributions).It is also used to teach programming to beginners. Here is list of some organization that uses python

  • Google has used it to implement many components of its Web crawler and search engine. The most interesting thing is that even the originator of Python, Guido van Rossum, is a Google employee.
  • NASA uses Python for several of its software systems, and has adopted it as the standard scripting language for its Integrated Planning System.
  • Industrial Light & Magic, Creator of star wars, uses Python in its production of special effects for large-budget feature films.
  • Yahoo! uses it (among other things) to manage its discussion groups and Yahoo maps.
  • Video sharing site Youtube uses it.
  • Disney uses Python for its animation production applications. It has developed a 3D engine “Panda3d” for the development of interactive graphics. Panda3d is developed as the joint venture of Disney and Carnegie Mellon university(CMU)

To learn more about the organization using python, visit the page http://wiki.python.org/moin/OrganizationsUsingPython

Advantage of python

  • Python is open source software so, it has huge open source community supporting it.
  • Python is available on an incredibly wide range of hardware and software platforms. This includes the usual suspects: Sun, Intel, IBM, Microsoft Windows variants, Macintosh OS variants and all *nix system.
  • Python programs require less time to develop than other high-level languages. Because of the elegance and simplicity of the language, Python programs tend to be 3-5 times shorter than their equivalent in Java, and 5-10 times shorter than C++ equivalents.
  • Since the python code is highly readable, programs are easier to maintain. So it reduces the maintenance cost which is crucial in software development.
  • Python has loosely typed language.
  • It exploits the full power of object oriented approach.
  • Python programs can be extended using C, C++, or Java. SWIG (Simple Wrapper and Interface Generator) helps to create the wrapper for python.
  • The popular web development framework for python such as Zope and Plone , Django, TurboGears make it popular for web development.

What the python users say?

Google

"Python has been an important part of Google since the beginning, and remains so as the system grows and evolves. Today dozens of Google engineers use Python, and we're looking for more people with skills in this language." said Peter Norvig, director of search quality at Google, Inc.

YouTube.com

"Python is fast enough for our site and allows us to produce maintainable features in record times, with a minimum of developers," said Cuong Do, Software Architect, YouTube.com.

Industrial Light & Magic

"Python plays a key role in our production pipeline. Without it a project the size of Star Wars: Episode II would have been very difficult to pull off. From crowd rendering to batch processing to compositing, Python binds all things together," said Tommy Burnette, Senior Technical Director, Industrial Light & Magic.

"Python is everywhere at ILM. It's used to extend the capabilities of our applications, as well as providing the glue between them. Every CG image we create has involved Python somewhere in the process," said Philip Peterson, Principal Engineer, Research & Development, Industrial Light & Magic.

Visit http://www.python.org/about/quotes/ to read more quotes from others.

Is Python Suitable for me?

Well, it depends on what you are seeking for. If you are completely new in the field of programming then it’s easy to learn and has good learning curve.

For the rapid development this is the best language. However, for the high end computational and simulation software which involves extremely complex graphics and mathematics, C and C++ may be the better choice. I think quotations from the web/software giants like Google, YouTube and Industrial Light & Magic speak more than thousands of my words.So, you can use to python if you want pleasing coding experience. And its true that python never bites.

NOTE: If you are sure that it’s the programming language for you then get the recent copy of the python from www.python.org . Eclipse IDE with Pydev plug-in will give great coding experience.

PHP:Object Oriented Programming

The term object-oriented involves thinking about processes as entities; in other words, the way we think about day-to-day objects. Object oriented programming is widespread today, and many universities teach object-oriented programming in beginning programming classes. Currently, Java and C++ are the most prevalent languages used for object-oriented programming. Object-oriented programming is not just a matter of using different syntax. It’s a different way of analyzing programming problems. In object-oriented programming, the elements of a program are objects. The objects represent the elements of the problem your program is meant to solve. Object-oriented programming developed new concepts and new terminology to represent those concepts.

But what about OOP in web scripts like PHP? Web scripts typically execute quickly and then go away. So you may think do we really need OOP concept in the web scripting? PHP wasn’t developed as an object oriented language. PHP began life as a simple set of scripts. However, PHP couldn’t be left untouched by this growing global phenomenon of OOP and its numerous advantages forced it to be reckoned by the language. Over the course of its life, PHP has evolved, more and more object-oriented features. First, you could define classes, but there were no constructors. Then, constructors appeared, but there were no destructors. Slowly but surely, as more people began to push the limits of PHP's syntax additional features were added to satisfy the demand. Object oriented programming became possible with PHP 4. With the introduction of PHP 5, the PHP developers have really beefed up the object-oriented features of PHP, resulting in both more speed and added features. Much of this improvement is invisible — changes introduced with the Zend 2 engine that powers PHP 5, that make scripts using objects run much faster and more efficiently than they did in PHP 4. PHP typically has a less thoroughgoing implementation of OOP than languages like c++, java, etc. There are still some concepts missing as function overloading and multiple inheritances.

OOP concept has really changed the way we used to program in PHP. Because of OOP, the code redundancy has been greatly reduced and using this technique we are able to make simple programs on the fly. This concept has helped to develop and maintain large scale projects easily. Features like inheritance, encapsulation and abstractions have helped speed up the development process of products using PHP. As of PHP5, it supports single inheritance, constructors and destructors, encapsulations, static functions, object interface, etc. So, OOP is certainly developing as an integral part of PHP programming.

Saturday, October 6, 2007

Open Source and Free Software:Does it really matters?

One can get Ubuntu-Linux CD's Delivered for Free to their Home(Desired Place),so would u rather Buy a Windows Vista Version worth Hundreds of USD?
You can find any software with Open Source,let it be Operating System or Web Browser,Media Player or Programming Language.Almost every softaware need can be sought and bought for Free,providing User with the Software and it's Source code too,isn't that luring.Good for the Users and even Jackpot fot the Novice Programmers on the verge of being Pro.
I can see many of my Friends moving Towards Linux these days,it's becoming Trendy.But on the other horizon i can anticipate the Future of Software Community.One doesnt need to buy any software anymore,just get connected via. Internet to the Internation GNU community and get your programme downloaded or delivered by your doorsteps.
Around My circles of friends,nobody pays a buck for the software,they either use the Open source one or the Pirated one.May be cos the users of my nation(nepal) aint so well off to pay the costly softwares in USD.
GNU is making a huge wave around the world,its something like Socialism.Everything for Free.If one gets the programme he must get the source code too,thats his right.Even the programme shouldn't be out for Commercial,it should be meant for serving Human e-Civilization .Just look at contribution of Wikipedia,and the popularity it has gained.When it's free and usefull everyone loves to use it.
"Free software" is a matter of liberty, not price. To understand the concept, you should think of "free" as in "free speech," not as in "free beer."
Richard Matthew Stallman can be next Karl Marx,who sought the future of e-Civilization. :-)

I. The freedom to run the program, for any purpose (freedom 0).
II. The freedom to study how the program works, and adapt it to your needs (freedom 1)Access to the source code is a precondition for this.
III. The freedom to redistribute copies so you can help your neighbor (freedom 2).
IV. The freedom to improve the program, and release your improvements to the public, so that the whole community benefits (freedom 3). Access to the source code is a precondition for this.

So does Free Software and Open source matters?What do you think?
--Bishwa Hang Rai

Friday, October 5, 2007

Nepal CA Polls postponed.Fragile Peace Process!!!

After a week long Talks between the seven parties and the CPN(Maoist),no conclusion could be drawn.PM summons special parliament session on Oct 11 to discuss on the 2 main demands of the maoist.Firstly,Announcement of Republic Of Nepal Through the Interim Parliament.Secondly,"Samanupatik Nirwachan Pranali".Well, i dont thikin NC is on a hurry to Declare the Republic of Nepal before the CA election.On the other hand,maoist are giving their full efforts to make their demands heard and fullfilled.Its like fighting another People's War for the Maoist.
CPN(UML),another Big and Powerful party in the Government,announces programmes of Protest for Postponding the CA polls.Being a Public,i can quote they were the only one who were first on the run for Election.Good for the People.
UML earlier proposed the Idea of General Referendum for declaring Republic Of Nepal and "Samanupatik Nirwachan Pranali",but maoist didnt get their eye on it as they were happy with PM GP Koirala,giving them 73 seats in Interim Parliament,making them as big party(Virtually) as UML.Though now,they are giving their Level Best effortt,heir entire polical Power and Future Politcal Life to meet those two demands,which was earlier proposed by the UML.
Congres weren't United and Their Vote Banks of Terrain was in Horrible Condition,so they were salcky in the CA election process.Whereas Maoist were Afraid as their Self Polling showed,Except Mahara,no other candidates were strong to Win.God Bless YCL adn their Reputaion on Public,hehe.
UML was the DarkHorse and was on the Frontline,a bloody good chance for them.
Frankly,an environment of the Election was already created,we could hera people talk about CA election everywhere.I was anxiuosly waiting for the Namelist of the Candidates on 13th of Asoj.And nothing happens.Election postponed fot the Second time.
The Same thing had happened before in the history of Nepal,after the Democracy in 2007 B.S the whole nation was meant to go for the CA polls but it was postponed and Postponed,whilst later it was never held and General Election was held on 2017 B.S which was followed by the total seize of power by Mahendra on 2019 B.S.Lets not hope the History Repeats.
The whole international community was watching us,striving towards permanent peace and stability but now as CA Election has been postponed,they too are showing Strong Concern of Disapproval for the postponed of CA election.

Lets hope fruitfull Outcome from the Oct 11 Special House session.Let the CA election be held,if possible in the pre-scheduled time Nov 22.. Let the Leaders purge their Parties benifits and Selfishness for the Stable Peace of Nation.

Let my Country Wake,Let the Youth Wake...

---Bishwa Hang Rai