Advertisment

Embedding Open Source Search Engine

author-image
PCQ Bureau
New Update

A common feature of websites is to have an inbuilt search facility for

retrieving data of user's interest. Developers generally incorporate in their

website the customized search APIs of popular search engines like Google,

Yahoo!, MSLive, Amazon, etc. These companies crawl the related websites and

provide search facility among the documents of those websites and of worldwide

web also. It may also act as an advertisement for them through the websites. As

a matter of pride, many organizations would prefer to have their own search

engine embedded in the website.

Advertisment

A decade ago, many search engines like Altavista , Lycos, Yahoo, Askjeeves

(now ask.com) were popular. Later, Google with its sophisticated ranking

strategy ensured acceptable results for different types of user queries. But

getting the customized service of Google is shareware and many sites may not be

able to pay for availing the facility. Then different search engines like cuil,

guruji, khoj, terrior came up with their own ranking strategy in the web

supporting multiple languages. Along with these developments, Open Source search

engines also emerged aside.

Nutch



Nutch is an Open Source search engine developed in JAVA on top of Lucene, which
itself is a free Open Source information retrieval system. Nutch can be deployed

in Internet or Intranet environments and can be customized for building small or

large scale information retrieval systems supporting multiple languages.

Direct Hit!
Applies To:

Website Developers and Information Retrieval Researchers



USP: Developing search engine for text
documents retrieval



Primary Link: http://lucene.apache.org/nutch/


Keywords: Nutch, Lucene

Advertisment

Prerequisites



1. JAVA and JRE should be installed and path variables for JAVA_HOME and
JRE_HOME should be set.



2. Set Path to current ANT build, if not done already. Apache Ant is a
JAVA-based build tool which builds the project using configuration files that

are XML based. Its current version (1.7.1) can be downloaded from



www.apache.org/dist/ant/binaries/apache-ant-1.7.1-bin.tar.gz


Search Result by Nutch

Installing and configuring Nutch



The latest version of Nutch (ver 0.9) can be downloaded from http://www.apache.org/dist/lucene/nutc.
Assume that the login is pcquest and the home folder is /home/pcquest. Create a

folder, named say and download the file nutch-0.9.tar.gz (Size 68MB)

in it, extract the contents therein and then go to folder /home/pcquest/mySearch/nutch-0.9/

which is the root folder of Nutch. Now Nutch has to be configured, which

includes two tasks:

Advertisment

1. Configuring Crawl Filter: Edit the file conf/crawl-urlfilter.txt file and

change — to + only at one place after the line “# skip everthing else” so that

it appears as:

# skip everthing else



+

2. Modification to Nutch configuration: This includes the folder containing

the crawled data and enables Nutch Searcher to search crawled web data.

Initially the file conf/nutch-site.xml does not contain any configuration

details. We have to modify it by including the target folder which

contains the crawled data. Add the following lines between

tags:

Advertisment







searcher.dir


/home/pcquest/mySearch/nutch-0.9/myCrawled


Path to the crawled data of your web site



The file conf/nutch-default.xml should be modified for including agent name

between the tags . We use 'pcquest' as the agent name and the

final entry looks like:

Advertisment

http.agent.name



pcquest

Now Nutch is ready for crawling and indexing your website.

Advertisment

Crawling, indexing and searching Website



Nutch initially crawls and indexes websites and is then ready for serving user's
query through searching the indexed data.

1. Crawling and Indexing websites: In the Nutch folder, /home/pcquest/mySearch/

nutch-0.9/; make a directory named rls in which, create a text file named

seed_urls having list of urls one per line (we used, http://www.iitkgp.ac.in/)

and then build the system using command — “ant && ant war”. Now remove ROOT*

from webapps folder of Apache-tomcat folder and copy nutch-0.9.war into the

webapps folder of tomcat in the name of ROOT.war and restart tomcat server. Now

the system should perform crawling using the following command:

$ ./bin/nutch crawl urls/seed_urls -depth 2 -threads 10

-dir myCrawled

Advertisment

It should be ensured that the folder myCrawled does not exist already. The

above command creates the folder named myCrawled and stores the crawled &

indexed data in it. If this folder already exists, then the crawler terminates.

The values of the parameters depth and threads are user defined where depth

shows the level of the websites to be crawled and threads shows the number of

concurrent crawling processes. Once crawling is over, then searching starts with

Nutch user interface.

2. Searching for User Query among Indexed documents

The deployment of the search engine can be tested using address

http://localhost:8080/. This loads the default Nutch user interface (below)which

can be modified to fit in your website.



Instead of using the above default interface, the following code can be used to
include search box with submit button in your website:



method="get">



 

value="Search">


Here 10.5.16.234 is the IP address of our computer running tomcat. The

resulting search box is shown at the bottom.

Conclusion


The major bottleneck of search engines is the relevance of retrieved results.
Nutch's performance can be tuned by either implementing a ranking algorithm by

web developers or by tuning the boosting parameters available in the current

build of the Nutch system. Nutch also provides facilities for including various

user defined plugins and hence a cross lingual information access is possible

with multiple languages support.

Nutch internally calls Hadoop which is basically built on map-reduce

technology that is capable of operating in a distributed fashion. So Nutch

system raises itself as a powerful open source library which could be used for

solving search related issues in many classical real world problems of machine

learning and information retrieval. Apart from Nutch users may also try to

attempt using Terrior, which is another open source search engine with good

ranking strategy.

R. Rajendra Prasath and Sumit Goswami; I I T Kharagpur


Advertisment