The creation of a large-scale search engine is an onerous task and one that involves huge challenges.
A perfect automated search engine is one that crawls the Web quickly to gather all documents regularly to keep them up-to-date. Plenty of storage space is required to efficiently store indices or the documents themselves.
The magnitude of data that has to be handled on the ever-growing Internet involves billions of queries daily. The indexing system of a search engine should be capable of processing huge amounts of data by using its space most efficiently and handling thousands of queries per second. The best navigation experience should be provided to the users in the form of finding almost anything on the Web, excluding junk results, with the use of high precision tools.
The anatomy of a search engine includes major applications such as those that allow for crawling the Web, indexing, and searching.
Search engines today depend on spiders or robots (special software) designed to continuously search the Web to find new pages.
Web crawling is the most important aspect of a search engine and is the also most challenging. It involves interaction with thousands of Web servers and name servers. It is performed by many fast, distributed crawlers. They keep getting information regarding lists of URLs they need to crawl and store from a URL server. The crawlers start their travel with the most used servers and highly popular pages. Each crawler keeps hundreds of connections open at one time in order to retrieve Web pages quickly. The crawler has to look up the DNS, connect to the host, send a request and receive a response. It does not rank the Web pages, but retrieves copies of all them and stores them in a repository by compressing them. They’re later indexed and ranked based on different criteria. Everything from the visible text, images, alt tags, other non-HTML content, word processor documents, and more is indexed.
Crawlers usually visit the same Web pages repeatedly to ensure the site is a stable one and that the pages are being updated frequently. If a certain Web page is not functioning at some point, the crawlers are usually programmed to go back later to try again. However, if it is found that the page is either down continuously or not being updated frequently, they stay away for longer periods of time or index it slowly.
Crawlers also have the capability of following all the links found on Web pages, which they can then visit either right away or later. (more…)