You cannot stop problems in a network from occurring, but what you can do is detect these problems whenever they occur (and sometimes before they occur) and take steps to resolve them. You could do so in two ways. One way is to set up monitoring software that will periodically poll the hosts and services for abnormal behavior, such as the host not being up or not providing the necessary services correctly. In case of problems the software can send alerts. We will talk about how you can use this method in this article.
In the other way, the monitoring host does not actively check for normalcy. Instead, it waits for other hosts, such as switches or routers, to get back to it in case of a problem before notifying you. The protocol that the host uses in this case is SNMP (Simple Network Management Protocol). We talk about this method in detail in the following article.
Because of the type of problems the two methods identify, they are not exclusive. You may need to use them simultaneously.
Let's talk about the first method. To monitor our network, we used Nagios, an open-source network-monitoring software, which you can get from this month's PCQXtreme CD. Nagios works in Linux, but can be used to monitor any machine or service be it on Windows, Linux, UNIX or any other environment.
As a first step, you need to install and do the basic configuration in Nagios.
INSTALLING NAGIOS
Nagios can be installed on a machine with PCQlinux 2004 server installed. First extract the setup files from the Nagios
tarball.
#tar -zxvf nagios-1.2.tar.gz
This will create a directory named nagios-1.2 in your current directory. This will be your Nagios setup directory. Next create the installation directory where the Nagios binary and configuration files will be stored.
#mkdir /usr/local/nagios
Now add a user to the system, which will be used by the Nagios process to execute.
#adduser nagios
Go to the Nagios setup directory
#cd nagios-1.2
and run the configuration script:
#./configure -prefix=/usr/local/nagios
-with-cgiurl=/nagios/cgi-bin -with-htmurl=/nagios/ -with-nagios-user=nagios
-with-nagios-grp=nagios
|
This will configure the Nagios setup before you start up the compilation process to build the Nagios binaries.
Now compile the Nagios binaries.
#make all
and install the binaries and HTML files to the installation directory.
#make install
Install the init script, which will be used to start Nagios at boot time.
#make install-init
|
The script is stored as the file /etc/rc.d/init.d/nagios
Create and configure permissions on the directory for holding the external command file.
#make install-commandmode
Now install the SAMPLE configuration files. These files will work as a starting point when you configure Nagios to monitor hosts and services on the network.
#make install-config
These files will be stored in the /usr/local/nagios/etc directory and you'll have to change their default extension from
*.cfg-sample to *.cfg.
This is the initial base install of Nagios. The core of Nagios is a collection of Nagios binary and few configuration files. For Nagios to do anything useful it relies on plugins. Plugins are external scripts or executable programs, which the Nagios process uses to monitor the status of various hosts and services. Let's see how to install plugins on the monitoring host to make Nagios functional.
To install the Nagios plugins you have to first get the plugin RPMs either from http://sourceforge.net/ projects/nagiosplug/ or from the PCQuest CD and install them.
#rpm -ivh nagios-plugins-1.3.1-1.9.i386.rpm
Nagios expects the plugins to be placed in the directory /usr/local/nagios/libexec but the RPM installs the plugins in the directory /usr/lib/nagios/plugins. So, make a directory
#mkdir /usr/local/nagios/libexec
Move the plugins to that directory
#mv /usr/lib/nagios/plugins/* /usr/local/nagios/libexec
Lastly, create two symlinks for openssl files.
#ln -s /lib/libcrypto.so.0.9.6b /lib/libcrypto.so.4
#ln -s /lib/libssl.so.0.9.6b /lib/libssl.so.4
|
This will install most common plugins to check for the status of hosts, TCP services, local users, local swap space usage, etc. You can get more plugins from the Nagios plugin page at http://sourceforge.net/projects/nagiosplug/. You can also create your plugins and use it with
Nagios.
After the plugins and basic setup, it is time to configure the web interface for
Nagios.
The web interface provides you with a quick snapshot of the status of all monitored hosts and services. With the web interface you can also generate reports on the service, host availability trends and many other things. The Web interface includes a set of static HTML files and a few CGIs that provide you with dynamic content.
Setting up the Web interface
Open up the Apache configuration file /etc/httpd/conf/httpd.conf and append the following lines to it:
ScriptAlias /nagios/cgi-bin/
/usr/local/nagios/sbin/
AllowOverride AuthConfig
Options ExecCGI
Order allow,deny
Allow from all
Alias /nagios/ /usr/local/nagios/share/
Options None
AllowOverride AuthConfig
Order allow,deny
Allow from all
Also add the following lines to provide authentication to the Web interface.
AllowOverride AuthConfig
order allow,deny
allow from all
Options ExecCGI
AllowOverride AuthConfig
order allow,deny
allow from all
Now create a file named .htaccess in both /usr/local/nagios/share and
/usr/local/nagios/sbin
vi /usr/local/nagios/share/.htaccess
vi /usr/local/nagios/sbin/.htaccess
and add the following lines to the files.
AuthName “Nagios Access”
AuthType Basic
AuthUserFile /usr/local/nagios/etc/htpasswd.users
require valid-user
After this you will have to create the users who can assess the Nagios Web interface. This is done by using the htpasswd command supplied with Apache.
#htpasswd -c /usr/local/nagios/etc/htpasswd.users nagiosadmin
Enter the password on the screen and confirm it.
This will create a username (nagiosadmin) and password that has to be used to access the Nagios Web interface. You can add more users by running the above command and supplying a different username.
The command creates a file named htpasswd.users in the /usr/local/nagios/etc directory and it stores usernames and encrypted passwords which are then matched with the username and password supplied by the user, when accessing the Nagios interface. For things to work this way the system account apache should have read access to this file, as the apache Web server process works under this account. For that run the following command:
#chmod o+r /usr/local/nagios/etc/htpasswd.users
After making all these changes to your system restart the apache Web server.
#/etc/rc.d/init.d/httpd restart
Now the Nagios binaries are set, plugins installed and the web interface also configured, it is time to configure the way Nagios and the Web interface will work.
Configuring Nagios and the Web interface
The files used for configuring Nagios and the Web interface are Main configuration file and GI configuration file.
The main configuration file is /usr/local/nagios/etc/nagios.cfg and it contains a number of directives that affect how Nagios operates. This config file is read by both the Nagios process and the CGIs. This is the first configuration file that should be modified. The default file is appropriate for most cases but can be modified as per your requirement. One change that we suggest is to enable external commands by changing the value of check_external_commands option from 0 to 1 in the file.
This file/usr/local/nagios/etc
/cgi.cfg determines how the various Nagios CGIs will work, which are used when the Nagios Web interface is accessed. Like the main configuration file the default values are suitable for most requirements but you may want to have the following changes in it.
authorized_for_system_information=nagiosadmin
authorized_for_configuration_information=nagiosadmin
authorized_for_all_services=nagiosadmin
authorized_for_all_hosts=nagiosadmin
authorized_for_all_service_commands=nagiosadmin
authorized_for_all_host_commands=nagiosadmin
DEFINING HOSTS, SERVICES AND CONTACTS
Object configuration files are used to define hosts, services and hostgroups which Nagios will monitor and contacts, contactgroups, plugin commands, etc. This is where you define what things you want monitor and how you want to monitor them and whom to send notifications. The various object configuration files are hosts.cfg, services.cfg, hostgroups.cfg, contacts.cfg, contactgroups.cfg, checkcommands.cfg, misccommands.cfg, timeperiods.cfg, escalations.cfg, dependencies.cfg all found in the /usr/local/nagios/etc directory. Let's see how to configure each of these files.We will configure two hosts, a Windows 2003 machine and a PCQLinux 8 machine.
Open the hosts.cfg file. By default it contains several host definitions, which you can comment out if not required. The file contains a generic host definition template which should be used in the configuration so do not comment it out. Add the following definitions.
define host{
use generic-host host_name windows2k3 alias Win Server #1 address 192.168.3.11 check_command check-host-alive max_check_attempts 10
notification_interval 120
notification_period 24x7
notification_options d,u,r
}
define host{
use generic-host
host_name pcqlinux8
alias Linux Server #1 address 192.168.3.13
check_command check-host-alive
max_check_attempts 10
notification_interval 480
notification_period 24x7
notification_options d,u,r
}
The definitions are very simple, you have to define the host name, the template to use, the IP address of the host, the number of check commands before a notification is sent out and the time interval in minutes for re-notification. The notification options “d,u,r” define for what states the notifications are sent out, which are down, unreachable, recovered respectively. The check command option tells which of the plugins to be used to check the status of the host.
A host group definition is used to group one or more hosts together for the purposes of simplifying notifications. Each host that you define must be a member of at least one host group, even if it is the only host in that group. Hosts can be in more than one host group. So add the following lines to the hostgroups.cfg file.
define hostgroup{
hostgroup_name windows-servers
alias Windows Servers
contact_groups nt-admins
members windows2k3
}
define hostgroup{
hostgroup_name linux-servers
alias Linux Servers
contact_groups linux-admins
members pcqlinux8
}
The definitions of hostgroups are self explanatory. Contact groups define which contact groups will be notified of the status of hosts in that particular
hostgroup.
A service definition is used to identify a service that runs on a host, which you'll want to monitor from Nagios. The term service, as used here, can mean an actual service that runs on the host (POP, SMTP, HTTP, etc.) or some other type of metric associated with the host (response to a ping, number of logged in users, free disk space, etc.). Open the file services.cfg, and comment out the various service definition in it except for the generic service template. Now put definitions in this file for all services you want to monitor. Below are few example definitions for our two hosts.
define service{
use generic-service host_name windows2k3 service_description PING
is_volatile 0
check_period 24x7 max_check_attempts 3
normal_check_interval 5 retry_check_interval 1
contact_groups nt-admin
notification_interval 120
notification_period 24x7
notification_options c,r
check_command check_p-
ing!100.0,20%!500.0,60%
}
define service{
use generic-
service
host_name pcqlinux
service_description HTTP
is_volatile 0
check_period 24x7
max_check_attempts 3
normal_check_interval 2
retry_check_interval 1
contact_groups li-admin
notification_interval 240
notification_period 24x7
notification_options w,u,c,r
check_command check_http
}
These definitions are also similar to the host definitions with few differences. The normal check interval option controls the number of minutes before the service is checked for status when the last status was OK. The retry check interval option specifies the minutes for rechecking the service when the last service status was non-OK. The notification w,u,c,r stand for warning, unknown, critical and recovered respectively. Similar to the above definitions, more definitions can be added to the file. Use the commands stored in the plugin directory for the check command option to monitor other services.
A contact definition is used to identify someone who should be contacted in the event of a problem on your network.
define contact{
contact_name nagiosad alias Admin
service_notification_period 24x7
host_notification_period 24x7
service_notification_options w,u,c,r
host_notification_options d,u,r
service_notification_commands notify- by-email, notify- by-epager
host_notification_commands host-no tify-by- email, host-no tify-by- epager
email admin@
cmil.com
pager 98xxxxxx
}
The e-mail option defines the e-mail address where notifications for that contact are to be sent. The pager option can be used to send SMS alerts (read about it in the following story).
A contact group definition is used to group one or more contacts together for the purpose of sending out alert/recovery notifications. When a host or service has a problem or recovers, Nagios will find the appropriate contact groups to send notifications to, and notify all contacts in those contact groups.
To define contact groups:
define contactgroup{
contactgroup_name nt-admins
alias Administrators
members nagiosad }
define contactgroup{
contactgroup_name linux-admins
alias Linux Admin members nagiosad
}
Unless there is a compelling need you do not need to make changes to other files, as the default settings will work fine. But if you do want to change them, then the files themselves contain detailed information about the various options.
RUNNING NAGIOS
Now after configuring your files, you are all set to start the Nagios process, but before doing that open the file dependencies.cfg and comment out all lines in it. Dependencies are an advanced feature of Nagios that allow you to suppress notifications for hosts based on the status of one or more other hosts. But you don't need that at this moment so comment out the lines in it or it will prevent Nagios from running.
To start the Nagios process issue the following command.
#service nagios start
to make nagios start automatically at system boot issue the command.
#chkconfig nagios on
Now the Nagios process is up and running. When any host or service defined to be monitored by Nagios goes down or a service doesn't work properly you will be notified by e-mail and/or by SMS if you have configured Nagios to send SMS alerts as well.
In case you want to know the status of your monitored hosts and services at any particular point in time then you can use the web interface of Nagios. The interface can also be accessed by a WAP phone.
To access Nagios' Web interface open and access the url http://
provide the details and you will log on to the nagios interface. From here you can view the details about the hosts and services defined in the configuration files. Generate reports to look for trends in host and service availability, look at the configuration
files to see everything is defined properly or not, etc.
IBM TIVOLI, HP OPENVIEW, CAUNICENTER: THE BIG DADDIES
For really large and complex networks you need products of a similar class and capability to manage them. The big daddies of this game are Tivoli from IBM, OpenView from HP and Unicenter from CA. These are not single products, but a large array of components that you choose from. They go beyond simple network management to integrate enterprise-wide IT and IT asset-management functions such as storage management, application management, help-desk management and service-level monitoring and management. Needless to say, these are complex beings and need separate infrastructure and staff to be dedicated to them.
IBM Tivoli
IBM Tivoli is system-management software that monitors your network at the network component and business system-application levels. It identifies critical problems as well as misleading symptoms and effects, and then notifies support staff with the appropriate response, or automatically cures the problem. It has modules that discover TCP/IP and SNA networks, displays network topologies, correlates and manages events and SNMP traps, monitors network health and gathers performance data. Tivoli can be used to monitor not just network resources, but also applications and storage resources. Other parts of Tivoli can distribute software throughout the enterprise, manage changes and lets you control your IT assets, automate workflow through the enterprise and can remotely control systems and applications. It also lets you maintain device and application inventory for asset management in your organization.
HP OpenView
HP OpenView is network-management software that can manage the availability and performance of both voice and data networks. It discovers and maps the relationship of network devices to each other.
It also measures the performance of your network, providing insights into network utilization and possible bottlenecks. The network node manager details the network topology for switched networks. The solution can be used for IP, ATM and telecom networks. It provides problem detection with appropriate alarms and detailed statistics about the problem. The problem is also mapped on the network topology diagram. OpenView also works with MPLS VPNs and WANs.
CA Unicenter
Computer Associates Unicenter is a suite of products that provides database management, enterprise job management, operation management, IT resource management and service management. The operation-management part manages the health and availability of the infrastructure. It looks for performance of applications and network and provides optimization for mission critical applications, Web services, databases and system management. It lets you proactively manage LAN, WAN, switched and VLAN networks, as well as optimizes them. It has support for TCP/IP and SNA networks, OpenVMS and mainframe systems.
The IT resources management provides provides asset tracking capabilities through automated discovery, hardware inventory, network inventory, software inventory, configuration management, software usage monitoring, license management and extensive cross-platform reporting.