by October 3, 2000 0 comments

Searching and browsing for content are probably the most common tasks performed on the Web. Consequently, search engines and directories have become the most popular Websites today.

However, despite the great value added by these sites,service providers face two problems. One, the rapid growth of the Web keeps search engines on their toes as they struggle to scale up hardware and software.According to a recent survey, there are more than one billion Web pages online today. This amounts to more than 10 terabytes of text. However, the majority of traditional search engines can cover only a fraction of these. Moreover, an increasing amount of information gets hidden behind search forms, or gets stored in databases not directly accessible to search engines. The other problem, and the more serious one, is the difficulty in matching your needs with the information available on the Web.

In this series of articles, we introduce you to technologies that are making their way into Web-information management products. This article starts with traditional search technologies and moves on to some popular search engines.

Traditional search technologies

Traditionally, the process of information retrieval was restricted to static data present either in the form of a table, a file, or at the most, a collection of data files. Data was nothing more than a collection of records, each of them associated with unique keys. A search algorithm would accept an argument “a” and try to find a record whose key was”a”. Decades of research in information retrieval resulted in stables olutions like Verity, AltaVista, Fulcrum, ZyLAB, PLS, Open-Text and Lexis-Nexis.

Researchers at McGill University, Montreal, developed a very early Internet search engine–called Archie–in 1990. Archie searches the files on Internet FTP servers. Two more search gopher servers–Veronica in 1992 and Jughead in 1993–followed Archie.

In the case of traditional search technologies, you enter a keyword, or a key phrase (keywords along with Boolean modifiers, such as “and”, “or”, “not”) into a search service, which then scans an index of Web pages for the keywords. To determine the order in which to display pages, the engine uses an algorithm to rank sites that contain the keyword(s). For example, the engine may count the number of times the keyword appears on a page. Or it may look for keywords in metatags. A metatag is an HTML tag that provides information about a Web page.

Directories and search engines

The pioneering efforts of Archie and friends were rapidly followed by serious commercial efforts. We can classify the services available today into search engines and directories, although the distinction between the two is getting blurred.

Directories
Yahoo is the earliest and best-known Web directory. The greatest value added by directories is the hierarchy of topics, typically a tree with cross-links,with a large number of relevant pages and sites. Often, directories do not use a crawler to gather data. Instead, they maintain manually-produced directories,which respond to user queries. Directories are created on the basis of descriptions of Web pages submitted, or by editors employed by the directory,who have reviewed the pages.

Owing to the need for human input, directories cannot keep track of the latest changes in pages as well as search engines can. But proponents claim that the value added by cataloging and editorial inputs compensates for coverage.

Since the success of Yahoo, directories have been added to almost every search portal. Recently the Open Directory Project (dmoz.org) has been launched with a distributed editor base, to try and keep up with the growth of the Web and the diversification of topics.

Search engines
Search engines broadly consist of three components–the crawler, the index,and the search algorithm. Crawlers, also called spiders, are programs that automatically scan Websites and create indexes of URLs, keywords, and links. How Websites (or pages) are selected depends on how the crawler is set up. In many cases, crawlers are allowed to follow the links on a page to find other relevant pages. Since Web pages keep changing, the crawler returns to the siteperiodically to look for changes.

When a user submits a search query, a software working on theengine’s algorithm goes through the index to find Web pages with keyword matches and ranks the pages in terms of relevance.

The key here is “relevance”. We’ll talk about this next time. It will be worthwhile here to mention a couple of problems with this type of search. The starting point of the search biases the end results,and tends to be more towards pages on topics of mass interest. So, for instance,sports are covered more extensively than quantum computing. Web pages purposely submitted to search engines can also gain prominence. Websites also use a variety of tricks to boost their search rankings, including filling their pages with strings of repeated keywords in a color that makes them invisible to the viewer, or embedding keywords in the HTML code that underlies a page.

Search engines today

Meta crawlers
Meta-search services like dogpile.com or metacrawler.com submit a query to anumber of directory services and indexed search engines and provides the topresults from each to the user. However, such meta-search services are seldom an effective way to improve coverage of the Web. While such sites offer a morecomprehensive coverage than any single search engine, if you must search far and wide, it’s better to search at each engine individually. This is because meta-search services often cut off queries before they complete the search of an index in order to increase their speed. Also, they often provide inadequate translation between the query formats required by different engines. So, results may not be in consonance with actual data available.

Surfwax (www.surfwax.com)takes meta crawlers a step ahead. Developed using technologies that define words and word relationships, Surfwax improves upon existing meta searchers. When you enter a query in the search box, two frames appear below it. Loading first, the left-hand frame contains results from various search services including FAST(Fast Search & Transfer) and Google. What distinguishes Surfwax is a small,light green icon that sometimes appears to the left of linked hits. Clicking on this icon loads abstracts, key points and buzzwords from the matching resource in the right-hand frame. You can then select one or more of the buzzwords to refine your search.

FAST
FAST (www.fast.no) uses several approaches to make its search service(www.alltheweb.com)faster. In terms of the number of unique URLs maintained, FAST places itself ahead of Northern Light (www.northernlight.com), that continued to be recognized as the largest Web search engine by many industry experts, and garnered accolades and awards throughout 1999. At all the web, because of the scalability in the architecture, the average response time for an advanced search is under a second, compared to an industry average of four to four-and-a-half seconds.

The company credits its search service’s performance in part to fast indexing algorithms, large arrays of off-the-shelf servers, storage systems and interconnects and software that efficiently utilize server capabilities. For sheer scope of coverage, not to mention processing speed (asits name implies), FAST takes the lead. But anyone who searches FAST will soon discover its weakness–relevance. For example, a search for “Science Magazine” returns links for Science Fiction, Fantasy, Weird Fiction Magazine Index, Science Humor Magazine, and other unrelated items. A search for the laws of Northern Ireland finds news items, a white paper, and constitutional proposals within the first page of hits, but not Her Majesty’s Stationery Office, which publishes these laws. Clearly, FAST requires further work to be of good use to professional researchers.

Direct Hit
Believing that popularity is a relevant criterion, Direct Hit (www.directhit.com)monitors what Website searchers select from a list. It factors in statistics like time spent at a selected site and then applies this information to refine the engine’s index. If searchers frequently select site X in response to queryY, Direct Hit boosts X site’s relevancy ranking. Direct Hit terms its technology as the Popularity Engine. Unfortunately, success with Direct Hit depends on the choices of earlier searchers, who may or may not have the sameinformation needs–including subject familiarity and intended use–as you may have.

Engines based on natural language

“What is nature?” would seem to be about as vague a question asyou can get. But put it to Ask Jeeves (www.askjeeves.com), a popular Internet search engine that claims to perform natural-language analysis, and its preferred response is: “Nature, international weekly journal ofscience”. Search with other leading engines using the single keyword”nature”, and Nature or its sister journals appear in the top-tenreturns.

These choices reflect popular association of the word with the publication, Nature, but fail to recognize that the original question,interpreted without that baggage, could rightfully evoke responses involving ecology.

Ask Jeeves and some other search services, such as ElectricMonk, claim to perform natural-language analysis, but evidence of such processing is rare. For example, confronted with the query “Has the JTextField focus bug in Swing been fixed?”, a question that a Javaprogrammer may ask, Ask Jeeves responds with the following related questions:

  • Where can I learn about controlling insect pests on my lawn?

  • Where can I learn about the insect or arachnid?

  • Where can I find information on the new Ford Focus?

  • Where can I find music resources for Swing music?

Obviously, these are all wide off the mark.

There are well-reputed research projects and products such asCYC and WordNet which approach natural-language analysis more seriously, but such search systems are not publicly available, and most likely don’t have millisecond-scale response times that are expected of search engines. If you are interested in exploring this area, you can check out the following Websites.

Emerging technologies

Many new techniques are emerging to analyze the above web of information and to extract useful patterns and discoveries. Let’s explore social-network analysis, an emerging technique that promises to influence the future of Web search and discovery.

Some of the best-known recent innovations in search-engine technology take their inspiration from the analysis of social networks.Conventional search engines rank query responses based on the frequency of keywords specified in a query. A new breed of engines, best exemplified by Google (www.google. com), exploits the links between Web pages. Roughly speaking, pages with many links pointing to them–akin to highly cited papers–are considered as “authorities”, and are ranked highest in search returns.

Google’s algorithms rank Web pages by analyzing their hyperlinks in a series of iterative cycles. Whereas most search engines,including the ones discussed earlier, only associate the text of a link with the page the link is on, Google associates it with the page the link points to. This allows it to cover many more pages than it actually crawls, even yielding links to sites that bar search engines’ crawler programs. Clever, a prototype search engine developed at IBM Almaden, takes the citation analogy further. Like Google,it produces a ranking of authorities, but it also generates a list of”hubs”–pages that have many links to authorities. Hubs are akin to review articles, which cite many top-rated papers. Those that link to many ofthe most highly cited authorities are given higher rankings than those that link to less popular sites. Users get not only the top hits, but hubs also provide a good starting point for browsing.

In this piece, we took a tour of the prominent Web-based search engines and technologies that power them. In the next issue, we’ll takea deeper look into social-network analysis of the Internet. We’ll also try to explore the algorithms behind and the various techniques used in such analyses.

Guru Shyam, with inputs from Prof Soumen Chakrabarty

No Comments so far

Jump into a conversation

No Comments Yet!

You can be the one to start a conversation.

Your data will be safe!Your e-mail address will not be published. Also other data will not be shared with third person.