The Anatomy of an Automated Search Engine

The creation of a large-scale search engine is an onerous task and one that involves huge challenges.

A perfect automated search engine is one that crawls the Web quickly to gather all documents regularly to keep them up-to-date. Plenty of storage space is required to efficiently store indices or the documents themselves.

The magnitude of data that has to be handled on the ever-growing Internet involves billions of queries daily. The indexing system of a search engine should be capable of processing huge amounts of data by using its space most efficiently and handling thousands of queries per second. The best navigation experience should be provided to the users in the form of finding almost anything on the Web, excluding junk results, with the use of high precision tools.

The anatomy of a search engine includes major applications such as those that allow for crawling the Web, indexing, and searching.

Web Crawling

Search engines today depend on spiders or robots (special software) designed to continuously search the Web to find new pages.

Web crawling is the most important aspect of a search engine and is the also most challenging. It involves interaction with thousands of Web servers and name servers. It is performed by many fast, distributed crawlers. They keep getting information regarding lists of URLs they need to crawl and store from a URL server. The crawlers start their travel with the most used servers and highly popular pages. Each crawler keeps hundreds of connections open at one time in order to retrieve Web pages quickly. The crawler has to look up the DNS, connect to the host, send a request and receive a response. It does not rank the Web pages, but retrieves copies of all them and stores them in a repository by compressing them. They’re later indexed and ranked based on different criteria. Everything from the visible text, images, alt tags, other non-HTML content, word processor documents, and more is indexed.

Crawlers usually visit the same Web pages repeatedly to ensure the site is a stable one and that the pages are being updated frequently. If a certain Web page is not functioning at some point, the crawlers are usually programmed to go back later to try again. However, if it is found that the page is either down continuously or not being updated frequently, they stay away for longer periods of time or index it slowly.

Crawlers also have the capability of following all the links found on Web pages, which they can then visit either right away or later.

Web Indexing

Every result that is found by the search engine spider is sent for indexing, to ensure speed in finding relevant documents for a search query. Indexing is performed by the indexer or the catalog and the sorter. This is like a huge book that contains a copy of all the Web pages that the spider finds. When the Web page changes, then this book is also updated automatically.

The indexer does a variety of jobs, such as reading the repository, decompressing the compressed documents, and parsing them. Basically, the function of the indexer is to allow information to be found as easily and as quickly as possible.

The repository where the crawler stores the Web page details contains the complete HTML of every single Web page. All the documents are stored in the repository, and each and every Web page that is available on the Web is given a unique doc ID number, which is assigned whenever a new URL is collected from a Web page.

The indexer extracts all the information from each and every document and stores it in a database. All high-quality search engines index each and every word in documents and give them a unique word ID. Then the word occurrences, which some search engines call “hits,” are checked. All of the words are recorded, including their placement in the document, their font size and capitalization.

Parsing

The indexer also parses out all the links in each and every Web page and stores their information separately in another file, including where the links are coming from and pointing to, as well as the text of the link. After the parsing is done, the indexer segregates these hits. It performs the initial sorting, thus creating a forward index that is partially sorted.

Then the file containing the links is read and converts the relative URLs into absolute URLs and turns them into unique doc IDs. It also enters the anchor text associated with the doc IDs into the forward index. Then a database of links is created, which includes pairs of doc IDs. This database is used to compute the page ranks of all the documents.

Sorting

The job of the sorter is to take the forward indexes, which are sorted by the doc ID and sort them again by word Id to generate the inverted index. This process is done one index at a time and does not require too much storage capacity. Parallel sorting processes are completed using multiple sorters. Since these indexes do not fit into the main memory, the sorter subdivides them into smaller groups based on the word ID and doc ID and then loads each group into the memory, sorts it, and writes its content.

Page Rank

Page rank in search engine results is thought of as the value of a Web page based on user behavior. This is usually determined by a search engine based on the number of visits to a Web page or group of pages as well as the number of pages pointing to a Web page and even the page rank of the pages pointing to it. Page rank has other extensions too and is weighed by the link structure of the Web.

Searching

The final step is searching, which means providing the best quality search results for queries. The query is first parsed, then the words in the query are converted to word IDs. A search is done for information in the doc list for every word, the doc lists are scanned until the words in the search query are found, the page rank of those documents is computed, and then all the documents that matched by page rank are sorted and results are returned to the user.

The major issue that users face with some search engines is the poor quality of the results returned by search engines, which consume a lot of time. It can be frustrating when they can’t find the information they’re looking for.

A high quality search engine returns quality and relevant results. Besides the quality of the results, it has to be designed to be able to scale to the growing size of the Web, by using storage efficiently. A search engine has to ensure that the huge number of documents on the Web can be crawled, indexed, and searched with little cost.