This article is the concluding piece of the series on
Web-information management. The first two articles in the series were on the
technologies that powered the first-generation search engines and how the
second-generation search engines exploit the social-network analysis for
effective mining of relevant information. In this article we will talk about
focused crawling that promises to contribute to our information-foraging
endeavors. We will also look at another technology, Memex, that lets you use
your past surfing experiences to search for relevant information on the Web.
How focused crawling works
Focused crawling concentrates on the quality of information
and the ease of navigation as against the sheer quantity of the content on the
Web. A focused crawler seeks, acquires, indexes, and maintains pages on a
specific set of topics that represent a relatively narrow segment of the Web.
Thus, a distributed team of focused crawlers, each specializing in one or a few
topics, can manage the entire content of the Web.
Rather than collecting and indexing all accessible Web
documents to be able to answer all possible ad-hoc queries, a focused crawler
analyzes its crawl boundary to find the links that are likely to be most
relevant for the crawl, and avoids irrelevant regions of the Web. Focused
crawlers selectively seek out pages that are relevant to a pre-defined set of
topics. These pages will result in a personalized web within the World Wide Web.
Topics are specified to the console of the focus system using exemplary
documents and pages (instead of keywords).
Such a way of functioning results in significant savings in
hardware and network resources, and yet achieves respectable coverage at a rapid
rate, simply because there is relatively little to do. Each focused crawler is
far more nimble in detecting changes to pages within its focus than a crawler
that crawls the entire Web.
The crawler is built upon two hypertext mining programs–a
classifier that evaluates the relevance of a hypertext document with respect to
the focus topics, and a distiller that identifies hypertext nodes that are great
access points to many relevant pages within a few links.
What focused crawlers can do
Here is what we found when we used focused crawling for many
varied topics at different levels of specificity.
-
Focused crawling
acquires relevant pages steadily while standard crawling (like the ones used
in first-generation search engines) quickly loses its way, even though they
start from the same root set. -
Focused crawling
is robust against large perturbations in the starting set of URLs. It
discovers largely overlapping sets of resources in spite of these
perturbations. -
It can discover
valuable resources that are dozens of links away from the start set, and at
the same time carefully prune the millions of pages that may lie within this
same radius. The result is a very effective solution for building
high-quality collections of Web documents on specific topics, using modest
desktop hardware. -
Focused crawlers
impose sufficient topical structure on the Web. As a result, apart from the
naïve topical search, powerful semi-structured query, analysis, and
discovery are also enabled. -
Getting isolated
pages, rather than comprehensive sites, is a common problem with Web search.
With focused crawlers, you can order sites according to the density of
relevant pages found there. For example, you can find the top five sites
specializing in mountain biking. -
A focused crawler
also detects cases of competition. For instance, it will take into account
that the homepage of a particular auto-manufacturing company like Honda, is
unlikely to contain a link to the homepage of its competitor, say, Toyota. -
Focused crawlers
also identify regions of the Web that grow or change dramatically as against
those that are relatively stable.
The ability of focused crawlers to focus on a topical
sub-graph of the Web and to browse communities within that sub-graph will lead
to significantly improved Web resource discovery. On the other hand, the
one-size-fits-all philosophy of other search engines, like AltaVista and Inktomi,
means that they try to cater to every possible query that might be made on the
Web. Although such services are invaluable for their broad coverage, the
resulting diversity of content is often of little relevance or quality.
Memex
Let’s see how surfing history and bookmarks of a community
of users can be exploited to search for information on the Web. Browsers discard
most information sought by users through clicking, unless the information is
deliberately book-marked. Even deliberate bookmarks are stored in a passive and
isolated manner.
A browser-assistant prototype, called Memex, which addresses
this issue, is now in the final stages of development in IIT, Mumbai. It will be
made available from http://memex.cse.iitb.ernet.in
by January 2001. Memex is a repository for both surfing history and bookmarks of
a community of users. It is designed as a browsing assistant for individuals and
groups with focused interests. It blurs the artificial distinction between
browsing history and deliberate bookmarks. The glut of data generated as a
result of Web browsing is analyzed in a number of ways at the individual and
community levels. It is indexed not only by keywords but also according to the
user’s view of topics, which lets the user recall topic-based browsing
contexts by asking questions like– What trails was I following when I was last
surfing about classical music? What are some popular pages related to my recent
trail regarding cycling? What was the URL I visited six months back regarding
compiler optimization at Rice University? What was the Web neighborhood I was
surfing the last time I was looking for resources on classical music? How is my
ISP bill divided into access for work, travel, news, hobby and entertainment?
How does my bookmark folder structure map on to my organization? In a hierarchy
of organizations (say, by region) who are the people who share my interest in
recreational cycling most closely and are not likely to be computer
professionals?
Conclusion
Information foraging is not about keyword level querying.
Tomorrow’s search needs will be more ad-hoc in nature. As this happens and the
Web evolves from a structural Web to a semantic Web, newer ideas and systems
will need to continue the process of innovation. Recent developments, together
with advances in natural language analysis, seem to be leading us in the right
direction.
Soumen Chakrabarti, assistant professor, Department of Computer
Science and Engineering, IIT Mumbai, and H
Guru shyam B E Computer Science student, NSIT, New Delhi