Friday, November 21, 2008  
Google
Web pcquest.com

CIOL Network sites

Search by Issue | CD Search | Sitemap | Advanced Search

"Ad: Nortel data network solutions are 40% more energy efficient" "Ad:Discover Green Intelligence, make your business strong"

Home > Technology > Filtering Focused Information

    Enterprise Solutions
    Hands On
    ITstrategy

    Developer

    Tech Forum

    Trends

    Shootout

    Reviews
    Editorials
    In Depth
    Technology
    Extraedge

    IT Careers

    Vertical Focus

Subscribe to Print magazine.


now!


Newsletter


Filtering Focused Information

Continued from Page 2

Combining content with link information

Relying extensively on links when searching for authoritative pages does have advantages. However, ignoring textual content after assembling the root set can lead to difficulties. On narrowly focused topics, HITS frequently returns resources for a more general topic. For instance, on querying for skiing in Nebraska, an area that the Web does not have many resources on,HITS will generalize and provide information on Nebraska tourist information.

Since all links out of a hub page propagate the same weight,HITS sometimes drifts when hubs discuss multiple topics. For instance, a chemist’s home page may contain good links not only to chemistry resources, but also to resources on her hobbies and regional information for her hometown. In such cases, HITS will confer some of the "chemistry" authority on to authorities for her hobbies and hometown, deeming these to be authoritative pages for chemistry.

Frequently, many pages from a single website will take over a topic simply because several of the pages occur in the base set. Moreover, pagesfrom the same site often use the same HTML design template. So, in addition to the information they give on a query topic, they may all point to a single popular site that has little to do with the query topic. This inadvertent topic hijacking can give a site too large a share of the authority weight for the topic, regardless of the site’s relevance.

Google vs Clever

Google faces most of the above shortcomings. Clever, on the other hand, addresses these issues by extending the HITS algorithm to textual analysis. The biggest difference is that unlike HITS where all edges are the same, Clever makes some edges "thicker" than others by analyzing text on the source page. If the query matches words close to a hyperlink, that link becomes "thicker". This way Clever can identify meaningful links rather than chance or whimsical links. The weights of links have a direct influence on the hub and authority scores.

Google and Clever are, by-and-large, the only engines that exploit the social structure of the Internet. Google Web search first computes a score, called page-rank, for every page indexed. Given a query, Google then returns pages (the authoritative ones) containing the query terms, ranked in order of these pages’ page-ranks.

An important factor that differentiates Clever from Google is the time of computation of popularity, or the degree of authority. Whereas Clever has a query-time iterative mechanism, Google pre-computes the page-ranks.This makes Google faster and more practical, but at the same time more prone to reporting off-topic pages as compared to the Clever system.

Despite the sophisticated link analysis, Google seeks to answer any possible query, and must therefore crawl the entire Web, or as much of it as possible. Since Google computes popularity offline before, and without any regard to queries, the measure of authority may be significantly distorted by noisy links leading to off-topic pages. While fixing this by determining the relevant graph at query time, Clever is naïve about the radius of expansion ofthe root set. There is nothing to suggest that great hubs and authorities lie within one link distance of relevant pages found by keyword search.

Conclusion

Given the content explosion on the Internet, searching the entire Web by keywords will soon be a thing of the past. Your personalized search needs will be met by dedicated search portals and focused crawling means.

Soumen Chakrabarti,
Assistant Professor, Department of Computer Science and Engineering, IIT, Powai,Mumbai, and H. Gurushyam, student of computer science, Delhi Institute ofTechnology,New Delhi


Page(s)   1   2   3   

End of the article

PC Problems? Get a solution in 24 hours. Ask Tech Expert




Untitled 1


Does your business have Green Intelligence


What is SDSIASWODB?


No.1 Linux platform for SAP Applications


Newsletter

Message boards

Discuss this and many other IT topics at the
CIOL message board

Previous Stories

Search Engines

Understanding Geek Talk

Setting up VLANs

   
 

 
 

Magazine Subscription | RQS | Contact Us | Team PCQuest | Advertising - Print