Tech Explained

Filtering Focused Information

PCQ Bureau

10 Nov 2000 21:50 IST

New Update

This article is in continuation of last month’s piece onsearch engines (page 71). There we discussed two issues–first, that even thelargest of engines can’t keep pace with the stupendous growth of the Web, andsecond, today’s one-size-fits-all approach to search must yield touser-adaptive searches, which learn from the past behavior of users.

In this article, we’ll take a look at technologies behind,what can be termed as, second-generation search engines–engines that exploitthe so-called social network of the World Wide Web. These technologies use acombination of statistics, pattern recognition, machine learning, and artificialintelligence to analyze sources of information and extract useful patterns fromthem.

Filtering information

A common approach for filtering information on the Net usestwo strategies: filter by relevance and filter by quality or popularity.Distilling a general topic to a size that will make sense to a user involvesidentifying the most definitive or authoritative Web pages on that topic. Thisis done by locating not only a set of relevant pages, but also those pages thatare of the highest quality. Relevance is handled, to an extent, by keywordmatching.

Another approach, that of tapping into the social network ofthe World Wide Web, can also derive notions of authority. The social network canbe accessed through hyperlinks that contain enormous amounts of latent humanannotation. Specifically, the creation of a hyperlink by the author of a Webpage represents an implicit endorsement of the page. Looking at suchendorsements collectively can give a better understanding of the relevance andquality of the page’s contents.

Google and Clever are two search engines that use bothapproaches of filtering and social-network analysis to search the Internet.Google (www.google.com) was developed atStanford, while researchers at IBM began the development of the Clever system (www.almaden.ibm.com/cs/k53/clever.html).Google analyzes hyperlinks to uncover the best pages on the Internet on a giventopic. Clever goes a step ahead and generates good starting points for Webnavigation on a given topic. Clever gives two types of pages: authorities, whichprovide the best source of information on a given topic, and hubs, which providecollections of links to authorities.

For Google, the measure of authority of a page isproportional to the total authority of all the pages that cite it. We will focusan Clever, since Google can be regarded as a special implementation of Clever.

Authorities and hubs

Even though the Web’s link structure can reveal notions ofauthority, it’s not possible to apply text-based methods to collect manypotentially relevant pages, and then comb them for the most authoritative ones.For example, if we were to look for the Web’s main search engines, we woulderr badly if we searched only for "search engines". Although the setof pages containing this term is enormous, it doesn’t contain most of thenatural authorities we would expect to find, such as Yahoo!, Excite, Infoseekand AltaVista. Similarly we can’t expect Honda’s or Toyota’s home pages tocontain the words "Japanese automobile manufacturers", nor Microsoft’sor Lotus’ home pages to contain the words "software companies".

This difficulty arises mainly because many links lacksemantic content, that is, although most links represent the type of endorsementwe seek (for example, a software engineer whose home site links to Microsoft andLotus), others are created for reasons that have nothing to do with conferringauthority. Some links exist purely for navigational purposes: "Click hereto return to main menu". Others serve as paid advertisements: "Thevacation of your dreams is only a click away".

The question then arises: how do we model the way in whichauthority is conferred on the Web? Clearly, when commercial or competitiveinterests are at stake, most organizations will perceive no benefit from linkingdirectly to another one. For example, AltaVista, Excite, and Infoseek may all beauthorities for the topic "search engines", but will be unlikely toendorse one another directly.

If, as in the above example, the major search engines don’texplicitly describe themselves as authorities, how can we determine that theyare indeed the most authoritative pages on search engines? We can say that theyare authorities because many relatively anonymous pages, clearly relevant to"search engines", link to AltaVista, Excite, and Infoseek. Such"anonymous" pages are hubs that link to a collection of prominentsites on a common topic.

Hub pages appear in a variety of forms, ranging fromprofessionally-assembled resource lists on commercial sites to lists ofrecommended links on individual home pages. These pages need not be prominentthemselves, or even have any links pointing to them. Their distinguishingfeature is that they confer authority on a focused topic. In this way, theyactually form a symbiotic relationship with authorities. Thus, we can say that agood authority is a page pointed to many good hubs, while a good hub is a pagethat points to many good authorities.

This relationship between authorities and hubs is central toexploring link-based methods of search, automated compilation of high-qualityWeb resources, and discovery of cohesive Web communities.

HITS: Computing authorityand hub scores

The Hyperlink Induced Topic Search (HITS) algorithm by JonKleinberg, which is the backbone of Clever search, computes lists of authoritiesand hubs for Web search topics. Beginning with a search topic specified by oneor more query terms, HITS creates a focused sample of several thousand Web pageslikely to be rich in relevant authorities, and determines the estimated weightsof hub and authority.

Such a technique can uncover Web communities, defined by aspecific interest, which even a human-assisted search engine like Yahoo! mayoverlook.

Combining content with link information

Relying extensively on links when searching for authoritative pages does have advantages. However, ignoring textual content after assembling the root set can lead to difficulties. On narrowly focused topics, HITS frequently returns resources for a more general topic. For instance, on querying for skiing in Nebraska, an area that the Web does not have many resources on,HITS will generalize and provide information on Nebraska tourist information.

Since all links out of a hub page propagate the same weight,HITS sometimes drifts when hubs discuss multiple topics. For instance, a chemist’s home page may contain good links not only to chemistry resources, but also to resources on her hobbies and regional information for her hometown. In such cases, HITS will confer some of the "chemistry" authority on to authorities for her hobbies and hometown, deeming these to be authoritative pages for chemistry.

Frequently, many pages from a single website will take over a topic simply because several of the pages occur in the base set. Moreover, pagesfrom the same site often use the same HTML design template. So, in addition to the information they give on a query topic, they may all point to a single popular site that has little to do with the query topic. This inadvertent topic hijacking can give a site too large a share of the authority weight for the topic, regardless of the site’s relevance.

Google vs Clever

Google faces most of the above shortcomings. Clever, on the other hand, addresses these issues by extending the HITS algorithm to textual analysis. The biggest difference is that unlike HITS where all edges are the same, Clever makes some edges "thicker" than others by analyzing text on the source page. If the query matches words close to a hyperlink, that link becomes "thicker". This way Clever can identify meaningful links rather than chance or whimsical links. The weights of links have a direct influence on the hub and authority scores.

Google and Clever are, by-and-large, the only engines that exploit the social structure of the Internet. Google Web search first computes a score, called page-rank, for every page indexed. Given a query, Google then returns pages (the authoritative ones) containing the query terms, ranked in order of these pages’ page-ranks.

An important factor that differentiates Clever from Google is the time of computation of popularity, or the degree of authority. Whereas Clever has a query-time iterative mechanism, Google pre-computes the page-ranks.This makes Google faster and more practical, but at the same time more prone to reporting off-topic pages as compared to the Clever system.

Despite the sophisticated link analysis, Google seeks to answer any possible query, and must therefore crawl the entire Web, or as much of it as possible. Since Google computes popularity offline before, and without any regard to queries, the measure of authority may be significantly distorted by noisy links leading to off-topic pages. While fixing this by determining the relevant graph at query time, Clever is naÃ¯ve about the radius of expansion ofthe root set. There is nothing to suggest that great hubs and authorities lie within one link distance of relevant pages found by keyword search.

Conclusion

Given the content explosion on the Internet, searching the entire Web by keywords will soon be a thing of the past. Your personalized search needs will be met by dedicated search portals and focused crawling means.

Soumen Chakrabarti,
Assistant Professor, Department of Computer Science and Engineering, IIT, Powai,Mumbai, and H. Gurushyam, student of computer science, Delhi Institute ofTechnology,New Delhi

Stay connected with us through our social media channels for the latest updates and news!