Advertisment

Monitor And Alert For Anomalies

author-image
PCQ Bureau
New Update

You cannot stop problems in a network from occurring, but what you can do is detect these problems whenever they occur (and sometimes before they occur) and take steps to resolve them. You could do so in two ways. One way is to set up monitoring software that will periodically poll the hosts and services for abnormal behavior, such as the host not being up or not providing the necessary services correctly. In case of problems the software can send alerts. We will talk about how you can use this method in this article. 

Advertisment

In the other way, the monitoring host does not actively check for normalcy. Instead, it waits for other hosts, such as switches or routers, to get back to it in case of a problem before notifying you. The protocol that the host uses in this case is SNMP (Simple Network Management Protocol). We talk about this method in detail in the following article. 

Because of the type of problems the two methods identify, they are not exclusive. You may need to use them simultaneously. 

Let's talk about the first method. To monitor our network, we used Nagios, an open-source network-monitoring software, which you can get from this month's PCQXtreme CD. Nagios works in Linux, but can be used to monitor any machine or service be it on Windows, Linux, UNIX or any other environment.

Advertisment

As a first step, you need to install and do the basic configuration in Nagios.

INSTALLING NAGIOS 



Nagios can be installed on a machine with PCQlinux 2004 server installed. First extract the setup files from the Nagios

tarball.

#tar -zxvf nagios-1.2.tar.gz

Advertisment

This will create a directory named nagios-1.2 in your current directory. This will be your Nagios setup directory. Next create the installation directory where the Nagios binary and configuration files will be stored.

#mkdir /usr/local/nagios

Now add a user to the system, which will be used by the Nagios process to execute.

Advertisment

#adduser nagios

Go to the Nagios setup directory

#cd nagios-1.2

Advertisment

and run the configuration script:

#./configure -prefix=/usr/local/nagios

-with-cgiurl=/nagios/cgi-bin -with-htmurl=/nagios/ -with-nagios-user=nagios

-with-nagios-grp=nagios

Watch

out for
How

to fix it
The e-mail server being down, if you have configured to send alerts through e-mail. Problem: You will not receive alerts  Configure to send alerts using SMS as well by connecting a GSM phone to your monitoring console
The monitoring console and other hosts being down at the same time. 



Problem: You will not receive any alerts 
Create a fault-tolerant setup with one standby monitoring machine that will take over if the primary machine fails
DNS being down if you have specified the FQDN, instead of IP addresses, of remote machines. Problem: DNS names will not be resolved to IP addresses Use IP addresses while defining critical hosts to be monitored. Hosts should have fixed IP addresses instead of having them assigned from a DHCP server

Advertisment

This will configure the Nagios setup before you start up the compilation process to build the Nagios binaries.



Now compile the Nagios binaries.

#make all

and install the binaries and HTML files to the installation directory.

Advertisment

#make install

Install the init script, which will be used to start Nagios at boot time.

#make install-init

CIM

protocol
The CIM (Common Information Model) is a DMTF (Distributed Management Task Force) backed initiative that aims to standardize the message formats used to describe management data across vendors as well as different 'systems, applications, networks and services'. Using XML as its backbone, it allows vendors to extend the standard as per the requirement, as long as the extensions conform to broad guidelines. These guidelines are defined in the CIM schema, which details the complete data model of the specification. 


The other part of the standard is the CIM specification, which contains guidelines for integration with other data models. CIM has played a very important part in accelerating the growth and adoption of various network management implementations available today, as it has given a common ground to various vendors and made interoperability between various standards possible.

The script is stored as the file /etc/rc.d/init.d/nagios



Create and configure permissions on the directory for holding the external command file.

#make install-commandmode

Now install the SAMPLE configuration files. These files will work as a starting point when you configure Nagios to monitor hosts and services on the network. 

#make install-config

These files will be stored in the /usr/local/nagios/etc directory and you'll have to change their default extension from

*.cfg-sample to *.cfg.

This is the initial base install of Nagios. The core of Nagios is a collection of Nagios binary and few configuration files. For Nagios to do anything useful it relies on plugins. Plugins are external scripts or executable programs, which the Nagios process uses to monitor the status of various hosts and services. Let's see how to install plugins on the monitoring host to make Nagios functional.

To install the Nagios plugins you have to first get the plugin RPMs either from http://sourceforge.net/ projects/nagiosplug/ or from the PCQuest CD and install them.

#rpm -ivh nagios-plugins-1.3.1-1.9.i386.rpm

Nagios expects the plugins to be placed in the directory /usr/local/nagios/libexec but the RPM installs the plugins in the directory /usr/lib/nagios/plugins. So, make a directory

#mkdir /usr/local/nagios/libexec

Move the plugins to that directory

#mv /usr/lib/nagios/plugins/* /usr/local/nagios/libexec

Lastly, create two symlinks for openssl files.

#ln -s /lib/libcrypto.so.0.9.6b /lib/libcrypto.so.4



#ln -s /lib/libssl.so.0.9.6b /lib/libssl.so.4

IPMI

Protocol
The IPMI (Intelligent Platform Management Interface) specification offers different vendors a standard way of monitoring all the mechanical components of a computer. This spares each vendor the need of having it's own monitoring technique and, hence, virtually no interoperability.


At the heart of any IPMI setup is a small, dedicated processor on the motherboard called the BMC (Baseboard Management Controller) that is used to monitor the hardware. It can communicate with the main processor and other hardware elements, collecting their temperature and voltages in addition to checking to see if the fans and the power supplies are working or not. Since a separate processor is handling all this, there is little performance impact on the system and it can continue to function even when the main processor goes down. Also supported is logging of all the data collected and raising alerts under specified conditions. Since all participating vendors use the same standard, it is possible to have cross vendor interoperability and management.


IPMI v2.0 adds many new capabilities to the standard. It offers enhanced security thanks to better authentication methods. Encryption is also incorporated. Also available is SOL (Serial Over LAN) that allows the serial controllers to be managed remotely, over the LAN. Support for VLANs has also been incorporated, thus preventing sensitive data from flowing all over the network, but localized to the 'management' VLAN only.

This will install most common plugins to check for the status of hosts, TCP services, local users, local swap space usage, etc. You can get more plugins from the Nagios plugin page at http://sourceforge.net/projects/nagiosplug/. You can also create your plugins and use it with

Nagios.

After the plugins and basic setup, it is time to configure the web interface for

Nagios. 

The web interface provides you with a quick snapshot of the status of all monitored hosts and services. With the web interface you can also generate reports on the service, host availability trends and many other things. The Web interface includes a set of static HTML files and a few CGIs that provide you with dynamic content.

Setting up the Web interface



Open up the Apache configuration file /etc/httpd/conf/httpd.conf and append the following lines to it:

ScriptAlias /nagios/cgi-bin/



/usr/local/nagios/sbin/

AllowOverride AuthConfig

Options ExecCGI

Order allow,deny

Allow from all

Alias /nagios/ /usr/local/nagios/share/

Options None

AllowOverride AuthConfig

Order allow,deny

Allow from all

Also add the following lines to provide authentication to the Web interface.

AllowOverride AuthConfig

order allow,deny

allow from all






Options ExecCGI

AllowOverride AuthConfig

order allow,deny

allow from all

Now create a file named .htaccess in both /usr/local/nagios/share and 

/usr/local/nagios/sbin

vi /usr/local/nagios/share/.htaccess

vi /usr/local/nagios/sbin/.htaccess

and add the following lines to the files.

AuthName “Nagios Access”

AuthType Basic

AuthUserFile /usr/local/nagios/etc/htpasswd.users

require valid-user

After this you will have to create the users who can assess the Nagios Web interface. This is done by using the htpasswd command supplied with Apache. 

#htpasswd -c /usr/local/nagios/etc/htpasswd.users nagiosadmin 

Enter the password on the screen and confirm it.

This will create a username (nagiosadmin) and password that has to be used to access the Nagios Web interface. You can add more users by running the above command and supplying a different username.

The command creates a file named htpasswd.users in the /usr/local/nagios/etc directory and it stores usernames and encrypted passwords which are then matched with the username and password supplied by the user, when accessing the Nagios interface. For things to work this way the system account apache should have read access to this file, as the apache Web server process works under this account. For that run the following command:

#chmod o+r /usr/local/nagios/etc/htpasswd.users 

After making all these changes to your system restart the apache Web server.

#/etc/rc.d/init.d/httpd restart

Now the Nagios binaries are set, plugins installed and the web interface also configured, it is time to configure the way Nagios and the Web interface will work. 

Configuring Nagios and the Web interface



The files used for configuring Nagios and the Web interface are Main configuration file and GI configuration file.

The main configuration file is /usr/local/nagios/etc/nagios.cfg and it contains a number of directives that affect how Nagios operates. This config file is read by both the Nagios process and the CGIs. This is the first configuration file that should be modified. The default file is appropriate for most cases but can be modified as per your requirement. One change that we suggest is to enable external commands by changing the value of check_external_commands option from 0 to 1 in the file.

This file/usr/local/nagios/etc

/cgi.cfg determines how the various Nagios CGIs will work, which are used when the Nagios Web interface is accessed. Like the main configuration file the default values are suitable for most requirements but you may want to have the following changes in it.

authorized_for_system_information=nagiosadmin

authorized_for_configuration_information=nagiosadmin

authorized_for_all_services=nagiosadmin

authorized_for_all_hosts=nagiosadmin

authorized_for_all_service_commands=nagiosadmin

authorized_for_all_host_commands=nagiosadmin

DEFINING HOSTS, SERVICES AND CONTACTS



Object configuration files are used to define hosts, services and hostgroups which Nagios will monitor and contacts, contactgroups, plugin commands, etc. This is where you define what things you want monitor and how you want to monitor them and whom to send notifications. The various object configuration files are hosts.cfg, services.cfg, hostgroups.cfg, contacts.cfg, contactgroups.cfg, checkcommands.cfg, misccommands.cfg, timeperiods.cfg, escalations.cfg, dependencies.cfg all found in the /usr/local/nagios/etc directory. Let's see how to configure each of these files.We will configure two hosts, a Windows 2003 machine and a PCQLinux 8 machine. 

Open the hosts.cfg file. By default it contains several host definitions, which you can comment out if not required. The file contains a generic host definition template which should be used in the configuration so do not comment it out. Add the following definitions.

define host{

use generic-host host_name windows2k3 alias Win Server #1 address 192.168.3.11 check_command check-host-alive max_check_attempts 10



notification_interval 120


notification_period 24x7


notification_options d,u,r


}


define host{


use generic-host


host_name pcqlinux8


alias Linux Server #1 address 192.168.3.13


check_command check-host-alive


max_check_attempts 10


notification_interval 480


notification_period 24x7


notification_options d,u,r


}












The definitions are very simple, you have to define the host name, the template to use, the IP address of the host, the number of check commands before a notification is sent out and the time interval in minutes for re-notification. The notification options “d,u,r” define for what states the notifications are sent out, which are down, unreachable, recovered respectively. The check command option tells which of the plugins to be used to check the status of the host.

A host group definition is used to group one or more hosts together for the purposes of simplifying notifications. Each host that you define must be a member of at least one host group, even if it is the only host in that group. Hosts can be in more than one host group. So add the following lines to the hostgroups.cfg file.

define hostgroup{

hostgroup_name windows-servers



alias Windows Servers


contact_groups nt-admins


members windows2k3


}


define hostgroup{



hostgroup_name linux-servers



alias Linux Servers


contact_groups linux-admins


members pcqlinux8


}


The definitions of hostgroups are self explanatory. Contact groups define which contact groups will be notified of the status of hosts in that particular

hostgroup.

A service definition is used to identify a service that runs on a host, which you'll want to monitor from Nagios. The term service, as used here, can mean an actual service that runs on the host (POP, SMTP, HTTP, etc.) or some other type of metric associated with the host (response to a ping, number of logged in users, free disk space, etc.). Open the file services.cfg, and comment out the various service definition in it except for the generic service template. Now put definitions in this file for all services you want to monitor. Below are few example definitions for our two hosts.

define service{

use generic-service host_name windows2k3 service_description PING



is_volatile 0


check_period 24x7 max_check_attempts 3


normal_check_interval 5 retry_check_interval 1


contact_groups nt-admin


notification_interval 120


notification_period 24x7


notification_options c,r


check_command check_p-


ing!100.0,20%!500.0,60%


}








define service{

use generic-



service


host_name pcqlinux


service_description HTTP


is_volatile 0


check_period 24x7


max_check_attempts 3


normal_check_interval 2


retry_check_interval 1


contact_groups li-admin


notification_interval 240


notification_period 24x7


notification_options w,u,c,r


check_command check_http


}












These definitions are also similar to the host definitions with few differences. The normal check interval option controls the number of minutes before the service is checked for status when the last status was OK. The retry check interval option specifies the minutes for rechecking the service when the last service status was non-OK. The notification w,u,c,r stand for warning, unknown, critical and recovered respectively. Similar to the above definitions, more definitions can be added to the file. Use the commands stored in the plugin directory for the check command option to monitor other services.

A contact definition is used to identify someone who should be contacted in the event of a problem on your network.

define contact{

contact_name nagiosad alias Admin



service_notification_period 24x7


host_notification_period 24x7


service_notification_options w,u,c,r


host_notification_options d,u,r


service_notification_commands notify- by-email, notify- by-epager


host_notification_commands host-no tify-by- email, host-no tify-by- epager


email admin@


cmil.com


pager 98xxxxxx


}








The e-mail option defines the e-mail address where notifications for that contact are to be sent. The pager option can be used to send SMS alerts (read about it in the following story).

A contact group definition is used to group one or more contacts together for the purpose of sending out alert/recovery notifications. When a host or service has a problem or recovers, Nagios will find the appropriate contact groups to send notifications to, and notify all contacts in those contact groups.

To define contact groups: 

define contactgroup{

contactgroup_name nt-admins



alias Administrators


members nagiosad }


define contactgroup{


contactgroup_name linux-admins


alias Linux Admin members nagiosad


}




Unless there is a compelling need you do not need to make changes to other files, as the default settings will work fine. But if you do want to change them, then the files themselves contain detailed information about the various options.

RUNNING NAGIOS



Now after configuring your files, you are all set to start the Nagios process, but before doing that open the file dependencies.cfg and comment out all lines in it. Dependencies are an advanced feature of Nagios that allow you to suppress notifications for hosts based on the status of one or more other hosts. But you don't need that at this moment so comment out the lines in it or it will prevent Nagios from running.

To start the Nagios process issue the following command.

#service nagios start

to make nagios start automatically at system boot issue the command.

#chkconfig nagios on

Now the Nagios process is up and running. When any host or service defined to be monitored by Nagios goes down or a service doesn't work properly you will be notified by e-mail and/or by SMS if you have configured Nagios to send SMS alerts as well. 

In case you want to know the status of your monitored hosts and services at any particular point in time then you can use the web interface of Nagios. The interface can also be accessed by a WAP phone.

To access Nagios' Web interface open and access the url http:///nagios/. Replace with the IP address of the machine running nagios. You will be asked for a username and password,

provide the details and you  will log on to the nagios interface. From here you can view the details about the hosts and services defined in the configuration files. Generate reports to look for trends in host and service availability, look at the configuration

files to see everything is defined properly or not, etc.

IBM TIVOLI, HP OPENVIEW, CAUNICENTER: THE BIG DADDIES

For really large and complex networks you need products of a similar class and capability to manage them. The big daddies of this game are Tivoli from IBM, OpenView from HP and Unicenter from CA. These are not single products, but a large array of components that you choose from. They go beyond simple network management to integrate enterprise-wide IT and IT asset-management functions such as storage management, application management, help-desk management and service-level monitoring and management. Needless to say, these are complex beings and need separate infrastructure and staff to be dedicated to them. 

IBM Tivoli



IBM Tivoli is system-management software that monitors your network at the network component and business system-application levels. It identifies critical problems as well as misleading symptoms and effects, and then notifies support staff with the appropriate response, or automatically cures the problem. It has modules that discover TCP/IP and SNA networks, displays network topologies, correlates and manages events and SNMP traps, monitors network health and gathers performance data. Tivoli can be used to monitor not just network resources, but also applications and storage resources. Other parts of Tivoli can distribute software throughout the enterprise, manage changes and lets you control your IT assets, automate workflow through the enterprise and can remotely control systems and applications. It also lets you maintain device and application inventory for asset management in your organization.

HP OpenView



HP OpenView is network-management software that can manage the availability and performance of both voice and data networks. It discovers and maps the relationship of network devices to each other. 

It also measures the performance of your network, providing insights into network utilization and possible bottlenecks. The network node manager details the network topology for switched networks. The solution can be used for IP, ATM and telecom networks. It provides problem detection with appropriate alarms and detailed statistics about the problem. The problem is also mapped on the network topology diagram. OpenView also works with MPLS VPNs and WANs.






CA Unicenter



Computer Associates Unicenter is a suite of products that provides database management, enterprise job management, operation management, IT resource management and service management. The operation-management part manages the health and availability of the infrastructure. It looks for performance of applications and network and provides optimization for mission critical applications, Web services, databases and system management. It lets you proactively manage LAN, WAN, switched and VLAN networks, as well as optimizes them. It has support for TCP/IP and SNA networks, OpenVMS and mainframe systems. 

The IT resources management provides provides asset tracking capabilities through automated discovery, hardware inventory, network inventory, software inventory, configuration management, software usage monitoring, license management and extensive cross-platform reporting.

Advertisment