The Anatomy Of An Automated Search Engine!
Monday, June 23rd, 2008The creation of a large-scale search engine is an onerous task and one that entails huge challenges.
A perfect automated search engine in the current scenario is one that crawls the web quickly and gathers all the documents to keep them up-to-date. Plenty of storage space is required to efficiently store indices or the documents themselves.
The magnitude of data that has to be handled on the ever-growing internet includes billions of queries daily. The indexing system of a search engine should be capable of processing huge amounts of data by using the space most efficiently and handling thousands of queries per second. The best navigation experience should be provided to the users, in the form of finding almost anything on the Web, excluding the junk results with the use of high precision tools, which is the main problem users’ face.
The anatomy of a search engine includes major applications such as crawling the web, indexing and searching.
Web Crawling
Search engines of today depend on spiders or robots, which are special software, designed to continuously search the web to find new pages.
Web crawling is the most important aspect of a search engine and is the most challenging. It involves interaction with thousands of web servers and name servers. It is performed by many fast distributed crawlers. They keep getting information regarding lists of URLs they need to crawl and store, from a URL server. The crawlers start their travel with the most used servers and highly popular pages. Each crawler keeps hundreds of connections open at one time in order to retrieve web pages quickly. The crawler has to look up the DNS, connect to the host, send a request and receive a response. It does not rank the web pages but retrieves copies of all the web pages and stores them in a repository by compressing them, to later index and rank them based on different criteria. Everything from the visible text, images, alt tags, other non-HTML content, word processor documents and more are indexed.
Crawlers usually visit the same web pages repeatedly to ensure the site is a stable one and that the pages are being updated frequently. If a certain web page is not functioning at some point, the crawlers are usually programmed to go back later to try again. However, if it is found that the page is either down continuously or not being updated frequently, they stay away for longer periods of time or index it slowly.
Crawlers also have the capability of following all the links found on the web pages, which they visit as an when they find them or visit them later. (more…)


