A common feature of websites is to have an inbuilt search facility for
retrieving data of user's interest. Developers generally incorporate in their
website the customized search APIs of popular search engines like Google,
Yahoo!, MSLive, Amazon, etc. These companies crawl the related websites and
provide search facility among the documents of those websites and of worldwide
web also. It may also act as an advertisement for them through the websites. As
a matter of pride, many organizations would prefer to have their own search
engine embedded in the website.
A decade ago, many search engines like Altavista , Lycos, Yahoo, Askjeeves
(now ask.com) were popular. Later, Google with its sophisticated ranking
strategy ensured acceptable results for different types of user queries. But
getting the customized service of Google is shareware and many sites may not be
able to pay for availing the facility. Then different search engines like cuil,
guruji, khoj, terrior came up with their own ranking strategy in the web
supporting multiple languages. Along with these developments, Open Source search
engines also emerged aside.
Nutch
Nutch is an Open Source search engine developed in JAVA on top of Lucene, which
itself is a free Open Source information retrieval system. Nutch can be deployed
in Internet or Intranet environments and can be customized for building small or
large scale information retrieval systems supporting multiple languages.
Direct Hit! |
Applies To: Website Developers and Information Retrieval Researchers USP: Developing search engine for text documents retrieval Primary Link: http://lucene.apache.org/nutch/ Keywords: Nutch, Lucene |
Prerequisites
1. JAVA and JRE should be installed and path variables for JAVA_HOME and
JRE_HOME should be set.
2. Set Path to current ANT build, if not done already. Apache Ant is a
JAVA-based build tool which builds the project using configuration files that
are XML based. Its current version (1.7.1) can be downloaded from
www.apache.org/dist/ant/binaries/apache-ant-1.7.1-bin.tar.gz
Search Result by Nutch |
Installing and configuring Nutch
The latest version of Nutch (ver 0.9) can be downloaded from http://www.apache.org/dist/lucene/nutc.
Assume that the login is pcquest and the home folder is /home/pcquest. Create a
folder, named say
in it, extract the contents therein and then go to folder /home/pcquest/mySearch/nutch-0.9/
which is the root folder of Nutch. Now Nutch has to be configured, which
includes two tasks:
1. Configuring Crawl Filter: Edit the file conf/crawl-urlfilter.txt file and
change — to + only at one place after the line “# skip everthing else” so that
it appears as:
# skip everthing else
+
2. Modification to Nutch configuration: This includes the folder containing
the crawled data and enables Nutch Searcher to search crawled web data.
Initially the file conf/nutch-site.xml does not contain any configuration
details. We have to modify it by including the target folder
contains the crawled data. Add the following lines between
The file conf/nutch-default.xml should be modified for including agent name
between the tags
final entry looks like:
Now Nutch is ready for crawling and indexing your website.
Crawling, indexing and searching Website
Nutch initially crawls and indexes websites and is then ready for serving user's
query through searching the indexed data.
1. Crawling and Indexing websites: In the Nutch folder, /home/pcquest/mySearch/
nutch-0.9/; make a directory named rls in which, create a text file named
seed_urls having list of urls one per line (we used, http://www.iitkgp.ac.in/)
and then build the system using command — “ant && ant war”. Now remove ROOT*
from webapps folder of Apache-tomcat folder and copy nutch-0.9.war into the
webapps folder of tomcat in the name of ROOT.war and restart tomcat server. Now
the system should perform crawling using the following command:
$ ./bin/nutch crawl urls/seed_urls -depth 2 -threads 10
-dir myCrawled
It should be ensured that the folder myCrawled does not exist already. The
above command creates the folder named myCrawled and stores the crawled &
indexed data in it. If this folder already exists, then the crawler terminates.
The values of the parameters depth and threads are user defined where depth
shows the level of the websites to be crawled and threads shows the number of
concurrent crawling processes. Once crawling is over, then searching starts with
Nutch user interface.
2. Searching for User Query among Indexed documents
The deployment of the search engine can be tested using address
http://localhost:8080/. This loads the default Nutch user interface (below)which
can be modified to fit in your website.
Instead of using the above default interface, the following code can be used to
include search box with submit button in your website:
method="get">
value="Search">
Here 10.5.16.234 is the IP address of our computer running tomcat. The
resulting search box is shown at the bottom.
Conclusion
The major bottleneck of search engines is the relevance of retrieved results.
Nutch's performance can be tuned by either implementing a ranking algorithm by
web developers or by tuning the boosting parameters available in the current
build of the Nutch system. Nutch also provides facilities for including various
user defined plugins and hence a cross lingual information access is possible
with multiple languages support.
Nutch internally calls Hadoop which is basically built on map-reduce
technology that is capable of operating in a distributed fashion. So Nutch
system raises itself as a powerful open source library which could be used for
solving search related issues in many classical real world problems of machine
learning and information retrieval. Apart from Nutch users may also try to
attempt using Terrior, which is another open source search engine with good
ranking strategy.
R. Rajendra Prasath and Sumit Goswami; I I T Kharagpur