Tech Explained

New Technologies for the Web

PCQ Bureau

29 Nov 2000 10:35 IST

New Update

This article is the concluding piece of the series on

Web-information management. The first two articles in the series were on the

technologies that powered the first-generation search engines and how the

second-generation search engines exploit the social-network analysis for

effective mining of relevant information. In this article we will talk about

focused crawling that promises to contribute to our information-foraging

endeavors. We will also look at another technology, Memex, that lets you use

your past surfing experiences to search for relevant information on the Web.

Advertisment

How focused crawling works

Focused crawling concentrates on the quality of information

and the ease of navigation as against the sheer quantity of the content on the

Web. A focused crawler seeks, acquires, indexes, and maintains pages on a

specific set of topics that represent a relatively narrow segment of the Web.

Thus, a distributed team of focused crawlers, each specializing in one or a few

topics, can manage the entire content of the Web.

Rather than collecting and indexing all accessible Web

documents to be able to answer all possible ad-hoc queries, a focused crawler

analyzes its crawl boundary to find the links that are likely to be most

relevant for the crawl, and avoids irrelevant regions of the Web. Focused

crawlers selectively seek out pages that are relevant to a pre-defined set of

topics. These pages will result in a personalized web within the World Wide Web.

Topics are specified to the console of the focus system using exemplary

documents and pages (instead of keywords).

Advertisment

Such a way of functioning results in significant savings in

hardware and network resources, and yet achieves respectable coverage at a rapid

rate, simply because there is relatively little to do. Each focused crawler is

far more nimble in detecting changes to pages within its focus than a crawler

that crawls the entire Web.

The crawler is built upon two hypertext mining programs–a

classifier that evaluates the relevance of a hypertext document with respect to

the focus topics, and a distiller that identifies hypertext nodes that are great

access points to many relevant pages within a few links.

What focused crawlers can do

Advertisment

Here is what we found when we used focused crawling for many

varied topics at different levels of specificity.

Focused crawling

acquires relevant pages steadily while standard crawling (like the ones used

in first-generation search engines) quickly loses its way, even though they

start from the same root set.
Focused crawling

is robust against large perturbations in the starting set of URLs. It

discovers largely overlapping sets of resources in spite of these

perturbations.
It can discover

valuable resources that are dozens of links away from the start set, and at

the same time carefully prune the millions of pages that may lie within this

same radius. The result is a very effective solution for building

high-quality collections of Web documents on specific topics, using modest

desktop hardware.
Focused crawlers

impose sufficient topical structure on the Web. As a result, apart from the

naÃ¯ve topical search, powerful semi-structured query, analysis, and

discovery are also enabled.
Getting isolated

pages, rather than comprehensive sites, is a common problem with Web search.

With focused crawlers, you can order sites according to the density of

relevant pages found there. For example, you can find the top five sites

specializing in mountain biking.
A focused crawler

also detects cases of competition. For instance, it will take into account

that the homepage of a particular auto-manufacturing company like Honda, is

unlikely to contain a link to the homepage of its competitor, say, Toyota.
Focused crawlers

also identify regions of the Web that grow or change dramatically as against

those that are relatively stable.

The ability of focused crawlers to focus on a topical

sub-graph of the Web and to browse communities within that sub-graph will lead

to significantly improved Web resource discovery. On the other hand, the

one-size-fits-all philosophy of other search engines, like AltaVista and Inktomi,

means that they try to cater to every possible query that might be made on the

Web. Although such services are invaluable for their broad coverage, the

resulting diversity of content is often of little relevance or quality.

Advertisment

Memex

Let’s see how surfing history and bookmarks of a community

of users can be exploited to search for information on the Web. Browsers discard

most information sought by users through clicking, unless the information is

deliberately book-marked. Even deliberate bookmarks are stored in a passive and

isolated manner.

A browser-assistant prototype, called Memex, which addresses

this issue, is now in the final stages of development in IIT, Mumbai. It will be

made available from http://memex.cse.iitb.ernet.in

by January 2001. Memex is a repository for both surfing history and bookmarks of

a community of users. It is designed as a browsing assistant for individuals and

groups with focused interests. It blurs the artificial distinction between

browsing history and deliberate bookmarks. The glut of data generated as a

result of Web browsing is analyzed in a number of ways at the individual and

community levels. It is indexed not only by keywords but also according to the

user’s view of topics, which lets the user recall topic-based browsing

contexts by asking questions like– What trails was I following when I was last

surfing about classical music? What are some popular pages related to my recent

trail regarding cycling? What was the URL I visited six months back regarding

compiler optimization at Rice University? What was the Web neighborhood I was

surfing the last time I was looking for resources on classical music? How is my

ISP bill divided into access for work, travel, news, hobby and entertainment?

How does my bookmark folder structure map on to my organization? In a hierarchy

of organizations (say, by region) who are the people who share my interest in

recreational cycling most closely and are not likely to be computer

professionals?

Advertisment

Conclusion

Information foraging is not about keyword level querying.

Tomorrow’s search needs will be more ad-hoc in nature. As this happens and the

Web evolves from a structural Web to a semantic Web, newer ideas and systems

will need to continue the process of innovation. Recent developments, together

with advances in natural language analysis, seem to be leading us in the right

direction.

Soumen Chakrabarti, assistant professor, Department of Computer

Science and Engineering, IIT Mumbai, and H

Guru shyam B E Computer Science student, NSIT, New Delhi

Advertisment