How Do Search Engines Work - Web Crawlers

Visited 406 times | Submited on 2007-04-18 14:24:03

You would be using search engines so you know how they work from the user perspective. From your own experience as a user, you also know that only those results that list at the top of the heap are most likely to attract you. It doesn't amuse you to know that your search yielded 44316 results. Perhaps even number 50 on your list will not get your custom or even your attention. Thus you know that getting listed on the top or as near to the top is crucial. Since most of the search engine traffic is free, you'll usually find it worth your time to learn a few tricks to maximize the results from your time and effort. In the next section, you will see how search engine works - from your perspective as a website owner. It is the search engines that finally bring your website to the notice of the prospective customers. Hence it is better to know how these search engines actually work and how they present information to the customer initiating a search. There are basically two types of search engines. The first is by robots called crawlers or spiders.

Search Engines use spiders to index websites. When you submit your website pages to a search engine by completing their required submission page, the search engine spider will index your entire site. A 'spider' is an automated program that is run by the search engine system. Spider visits a web site, read the content on the actual site, the site's Meta tags and also follow the links that the site connects. The spider then returns all that information back to a central depository, where the data is indexed. It will visit each link you have on your website and index those sites as well. Some spiders will only index a certain number of pages on your site, so don't create a site with 500 pages!

The spider will periodically return to the sites to check for any information that has changed. The frequency with which this happens is determined by the moderators of the search engine.

A spider is almost like a book where it contains the table of contents, the actual content and the links and references for all the websites it finds during its search, and it may index up to a million pages a day. Example: Excite, Lycos, AltaVista and Google.

When you ask a search engine to locate information, it is actually searching through the index which it has created and not actually searching the Web. Different search engines produce different rankings because not every search engine uses the same algorithm to search through the indices.

One of the things that a search engine algorithm scans for is the frequency and location of keywords on a web page, but it can also detect artificial keyword stuffing or spamdexing. Then the algorithms analyze the way that pages link to other pages in the Web. By checking how pages link to each other, an engine can both determine what a page is about, if the keywords of the linked pages are similar to the keywords on the original page. Most of the top-ranked search engines are crawler based search engines while some may be based on human compiled directories. The people behind the search engines want the same thing every webmaster wants - traffic to their site. Since their content is mainly links to other sites, the thing for them to do is to make their search engine bring up the most relevant sites to the search query, and to display the best of these results first. In order to accomplish this, they use a complex set of rules called algorithms. When a search query is submitted at a search engine, sites are determined to be relevant or not relevant to the search query according to these algorithms, and then ranked in the order it calculates from these algorithms to be the best matches first.

Search engines keep their algorithms secret and change them often in order to prevent webmasters from manipulating their databases and dominating search results. They also want to provide new sites at the top of the search results on a regular basis rather than always having the same old sites show up month after month. An important difference to realize is that search engines and directories are not the same. Search engines use a spider to "crawl" the web and the web sites they find, as well as submitted sites. As they crawl the web, they gather the information that is used by their algorithms in order to rank your site.

Directories rely on submissions from webmasters, with live humans viewing your site to determine if it will be accepted. If accepted, directories often rank sites in alphanumeric order, with paid listings sometimes on top. Some search engines also place paid listings at the top, so it's not always possible to get a ranking in the top three or more places unless you're willing to pay for it.

Let us now look at a more detailed explanation on how Search Engines work. Crawler based search engines are primarily composed of three parts.

A search engine robot's action is called spidering, as it resembles the multiple legged spiders. The spider's job is to go to a web page, read the contents, connect to any other pages on that web site through links, and bring back the information. From one page it will travel to several pages and this proliferation follows several parallel and nested paths simultaneously. Spiders frequent the site at some interval, may be a month to a few months, and re-index the pages. This way any changes that may have occurred in your pages could also be reflected in the index. The spiders automatically visit your web pages and create their listings. An important aspect is to study what factors promote "deep crawl" - the depth to which the spider will go into your website from the page it first visited. Listing 'submitting or registering' with a search engine is a step that could accelerate and increase the chances of that engine "spidering" your pages.

The spider's movement across web pages stores those pages in its memory, but the key action is in indexing. The index is a huge database containing all the information brought back by the spider. The index is constantly being updated as the spider collects more information. The entire page is not indexed and the searching and page-ranking algorithm is applied only to the index that has been created. Most search engines claim that they index the full visible body text of a page. In a subsequent section, we explain the key considerations to ensure that indexing of your web pages improves relevance during search. The combined understanding of the indexing and the page-ranking process will lead to developing the right strategies. The Meta tags 'Description' and 'Keywords' have a vital role as they are indexed in a specific way. Some of the top search engines do not index the keywords that they consider spam. They will also not index certain 'stop words' (commonly used words such as 'a' or 'the' or 'of'" so as to save space or speed up the process. Images are obviously not indexed, but image descriptions or Alt text or "text within comments" is included in the index by some search engines.

The search engine software or program is the final part. When a person requests a search on a keyword or phrase, the search engine software searches the index for relevant information. The software then provides a report back to the searcher with the most relevant web pages listed first. The algorithm-based processes used to determine ranking of results are discussed in greater detail later.

These directories compile listings of websites into specific industry and subject categories and they usually carry a short description about the website. Inclusion in directories is a human task and requires submission to the directory producers. Visitors and researchers over the net quite often use these directories to locate relevant sites and information sources. Thus directories assist in structured search. Another important reason is that crawler engines quite often find websites to crawl through their listing and links in directories. Yahoo and The Open Directory are amongst the largest and most well known directories. LookSmart is a directory that provides results to partner sites such as MSN Search, Excite and others. Lycos is an example of a site that pioneered the search engine but shifted to the Directory model depending on AlltheWeb.com for its listings.

Hybrid Search Engines are both crawler based as well as human powered. In plain words, these search engines have two sets of listings based on both the mechanisms mentioned above. The best example of hybrid search engines is Yahoo, which has got a human powered directory as well as a Search toolbar administered by Google. Although, such engines provide both listings they are generally dominated by one of the two mechanisms. Yahoo is known more for its directory rather than crawler based search engine.

Search engines rank web pages according to the software's understanding of the web page's relevancy to the term being searched. To determine relevancy, each search engine follows its own group of rules. The most important rules are.

- The location of keywords on your web page; and - How often those keywords appear on the page 'the frequency'

For example, if the keyword appears in the title of the page, then it would be considered to be far more relevant than the keyword appearing in the text at the bottom of the page. Search engines consider keywords to be more relevant if they appear sooner on the page (like in the headline) rather than later. The idea is that you'll be putting the most important words - the ones that really have the relevant information - on the page first.

Search engines also consider the frequency with which keywords appear. The frequency is usually determined by how often the keywords are used out of all the words on a page. If the keyword is used 4 times out of 100 words, the frequency would be 4%. Of course, you can now develop the perfect relevant page with one keyword at 100% frequency - just put a single word on the page and make it the title of the page as well. Unfortunately, the search engines don't make things that simple.

While all search engines do follow the same basic rules of relevancy, location and frequency, each search engine has its own special way of determining rankings. To make things more interesting, the search engines change the rules from time to time so that the rankings change even if the web pages have remained the same. One method of determining relevancy used by some search engines 'like HotBot and Infoseek', but not others 'like Lycos', is the Meta tags. Meta tags are hidden HTML codes that provide the search engine spiders with potentially important information like the page description and the page keywords.

Meta tags are often labeled as the secret to getting high rankings, but Meta tags alone will not get you a top 10 ranking. On the other hand, they certainly don't hurt. Detailed information on meta-tags and other ways of improving search engine ranking is given later in this chapter.

In the early days of the web, webmasters would repeat a keyword hundreds of times in the Meta tags and then add it hundreds of times to the text on the web page by making it the same color as the background. However, now, major search engines have algorithms that may exclude a page from ranking if it has resorted to "keyword spamming"; in fact some search engines will downgrade ranking in such cases and penalize the page.

Link analysis and 'clickthrough' measurement are certain other factors that are "off the page" and yet crucial in the ranking mechanism adopted by some leading search engines. This is quickly emerging as the most important determinant of ranking, but before we study this, we must first look at the most popular search engines and then look at the various steps you can take to improve your success at each of the stages - spidering, indexing and ranking.

Google is a privately held company that was founded by two Stanford graduates, Larry Page and Sergey Brin in 1998. Dr. Eric Schmidt, the CEO joined in 2001 and by the end of the year the company had shown a profit.

Yahoo has shown sales of around $ 225 million in the second quarter this year (inching close to the 1 billion mark over the year), and a net income of $ 21.4 million. Yahoo is a portal and not a pure play search engine company. It was founded by Jerry Yang.

Google is the search engine that powers the search directory for Yahoo. This partnership started in the year 2000 and recently there was a report that the contract is being extended. Last year, Yahoo paid Google about $7.2 million for Web search services. Inktomi has been a contender too for Yahoo's business. Google also provides an Apple-specific Search Engine specifically tailored to deliver highly targeted results related to Apple Computer and the Macintosh computing platform.

The Apple-specific search engine, located at www.google.com/mac.html, makes searching for everything from Apple's corporate information to product-related news faster and easier.

Inktomi has a robust networking business and a foothold in enterprise search. However, it recently posted deep losses. The company reported a wider net loss in the second quarter 2002, with lower revenue. Its loss broadened to $104 million or 72 cents a share, from $58.3 million, or 46 cents a share, a year earlier. Revenue fell to $30.8 million from $39.5 million a year earlier.

To stay healthy and competitive in consumer search, Inktomi introduced in the last year a program that generates fees from Web sites listed in its database. Inktomi charges companies such as Amazon.com and eBay to list more than 1,000 Web addresses; they might pay anywhere from 5 cents to 40 cents per click when Web surfers jump to their pages from Inktomi's database. The revenue generated from paid inclusion is shared with partners such as MSN and Overture.

For March 2003, according to a study by Jupiter Media Metrix, there were an estimated 114 million Internet users online in the US at work or at home, 80 percent of whom are estimated to have made some type of search request during the month.

By Ken Mathie



Add your comment

Name:(required)
E-mail address:(optional)
Comment:(required)
Repeat the number for validation: (required)

Browse by Tags:


Related Articles:

Text Link Ads

Statistics

Total 296 articles submitted
Latest submission at January 28, 2008 15:13

Feedback

Use this email below to send us your suggestions and feedback. We value your opinion.
info (at) theitarticles.com