Advertisment

Intranet Search Engine

author-image
PCQ Bureau
New Update

The custom installation of PCQ Linux 7.1 has an intranet class search engine, htdig, installed on it by default. The version supplied is 2.3 beta and the stable version is 1.5, which you can download from www.htdig.org/. To check whether it is pre-installed or not, log in as root and change to the directory /var/www/ html. Then do a directory listing with:

Advertisment

ls —l htdig

If the file ‘htdig’ exists you will see it along with a symbolic link to /usr/share/htdig. Next change directory to /var/www/cgi-bin and

issue:

ls —l htsearch
Advertisment

If the file ‘htsearch’ exists then the search engine executable exists. Use the ‘find’ command or ‘which’ command to find the location of the file named ‘rundig’.

which rundig

With this you’ll be sure that the search engine is pre-installed and ready for use. Next, change directory to

/usr/share/htdig.

Advertisment

This is where the html files required for generating the Web page for entering a search string and the resultant output Web pages reside. You can customize the Web pages to reflect your company’s design. Last, change directory to /etc, do ‘ls —l htdig.conf’ for the ‘htdig’ configuration file.

Now you’re ready to configure the search engine. This file will be the same for both the pre-installed and your personal installation. It may be named ‘sample.conf’ in the second case. Open this in your text editor like vi or joe. Scroll to the first setting ‘database_dir: /var/lib/htdig’, where the index database and index file will reside. Scroll till you find the line:

Start_url:

http://localhost
Advertisment

and change to your reflect url, such as http://www. mydomain.com/, if you have DNS, or enter the IP address of the Web servers, like http://192.68.1.100/. You can specify multiple domains or addresses to index, on the same line separated by spaces, for example:

start_url: http://192.68.1.100/ http://192.72.20.100/

The next line ‘limit_urls_to: ${start_url}’ with start_url and exclude_url ensures that the Web spider does not indexing pages forever. I have put hyperlinks to my company’s and other important Internet sites, and don’t have any connection to the Internet at present to ensure that the spider doesn’t spend time trying to access sites, which I don’t wish to index and are not accessible. This line does not require any change. The next value below is maintainer: Enter your e-mail address (for example, webmaster@mydomain.com), which will be entered into the access log of the Web server being indexed. Save and exit the file. You can experiment with the other settings after checking the efficiency of the output pages.

Advertisment

Now edit the file ‘rundig’ to point to the correct location of the configuration file, database directory, etc. Execute the command ‘rundig’and it will read the configuration file and start indexing the URLs listed in it. Depending upon the number of servers, available bandwidth for communication and data to index, it may take several minutes to hours. You can specify parameters to the ‘rundig’ command, such as rundig —v, rundig —vvv ,which gives the maximum information as it indexes various pages or sites. While this indexing is on you can provide a link for the search engine on your home page as follows:

Search.

This is it. Your search engine is up and running.

Jahaj

Advertisment