Implementation Guides

Performance and Fault Management

PCQ Bureau

03 Aug 2004 11:22 IST

New Update

Managing the network and providing optimal service to users without delay is a challenge. There are bottlenecks that can affect the network and application performance. The trouble with these bottlenecks is that they can start from anywhere, right from a rogue application shooting up your server's CPU utilization to a mis-configured or outdated server configuration. On the network, it could be due to inherent latency in the equipment or jitter and queuing due to excessive traffic flow.

Advertisment

Performance management deals with monitoring the performance levels of all your mission-critical network devices, raise alerts in case of performance deterioration and finally address the performance issues.

The Technologies

The primary and most widely used technologies for performance management are based on SNMP and RMON standards. There is an SNMP agent that runs on every device being monitored. This agent lets the network device exchange information with an SNMP-based performance-management system. The SNMP agent performs the function of accumulating real-time data that a management system can retrieve at periodic intervals. This data (summarized) is stored by the agent in a standard format defined by the MIB for the respective device. When any event, such as failures or threshold, is exceeded, the agent sends an SNMP trap to the management system and action is taken. The information provided by the SNMP agent is limited, but RMON probes, on the other hand, are used to provide detailed information. The RMON standard enables agents to communicate with the management system using SNMP and organizes monitoring functions into nine groups. But, RMON1 doesn't provide protocol information, application layer data or client-server communication across the network. RMON2 overcomes this by monitoring IP communication between clients and servers and shows who's talking to whom. RMON 2 provides proactive monitoring by automatic alarm facility for network violations. There are also proprietary technologies being used by some vendors such as Cisco and Lucent.

Advertisment

There is software available from many vendors that manages, monitors and reports the network's performance using varying technologies. Cisco has a software called SAA (Service Assurance Agent) that sits on its routers or switches. This software measures the performance by sending synthetic packets to the target IP device, which are echoed back to the sender (similar to ping). SAA also provides proactive notification by sending SNMP traps when a threshold is crossed. This allows monitoring the actual performance with the desired performances. SAA can also be configured to run automatically when the threshold is exceeded and information can be retrieved using CISCO's IOS command-line interface. Some of the performance management applications that deploy SAA are Firehunter, eHealth, VistaView and IPInsight.

SAA measures response time between IP devices. SNMP traps can be configured for generating alerts if the response time exceeds predefined thresholds. It has the ability to measure HTTP protocol, jitter of VOIP and packet loss. There is another technology called Cisco NetFlow that gives detailed statistics of traffic flow for capacity planning and troubleshooting. The statistics that you get are IP type of service, TCP/UDP source port and destination ports, input and output interface numbers.

The best way to do performance management is to first analyze the base performance of your network. On a normal day, determine the utilization level of various devices. This will form the network baseline data. For example, if you're monitoring a switch, then the baseline data could be related to per port throughput and buffer management. Once you know this, it becomes easier to recognize varying performance patterns. Doing a network what-if analysis will then determine the effect of changes to your network, which could, for instance, be an increase in network traffic due to the addition of a new application.

You then need to identify solutions to resolve capacity and performance issues from various conditions. The aim is to receive notification of capacity and performance threshold violations so that it can be rectified. The quality of service is equally important to the user and there should be minimum delay in the service.

Advertisment

Key performance parameters

Network performance is measured in a number of ways. One of them is the response time, which is the time it takes for a network packet to make a round trip between two points. If you've defined an ideal response time for your network based on the baseline data, then a value below this indicates network congestion and above this indicates network fault.

Another parameter is accuracy, which is a measure of traffic that flows without error and is expressed in percentage terms as compared to the success rate of data. For instance, if three out of every 100 packets result in error, the error rate would be 3 percent and the accuracy would be 97 percent. Some reasons for error are: the structured cabling not adhering to specifications, electrical interference, faulty hardware or software.

Alarm generation whenever performance goes below and above the set threshold

Advertisment

Next is utilization, which measures the use of resources over time and is expressed in percentage terms with respect to maximum operational capacity of the resource. Utilization helps to know the potential network congestion source. Either resource is underutilized or over utilized. Over utilization occurs due to more traffic that is queued to pass and low utilization indicates traffic flow in unexpected way. A sudden jump in resource utilization indicates some fault condition.

Another parameter factored into performance management is network availability, which represents the reliability of network components to respond to user needs. It is the measure of time for which the network is available to users. Availability can be measured by knowing the devices' statistics.

Latency and jitter are two other parameters used for measuring network performance. Latency is the time it takes for data to flow from one point to another on the network under zero load condition. Similarly, jitter is the time when there's a lot of traffic on the network, wherein some packets may reach the destination out of order, which thereby causes data to be scrambled.

Advertisment

Fault management

Fault management is the process of detecting, isolating and correcting faults on a network. You need solutions to ensure that faults can be managed without bringing down the network or causing any degradation of service to users. Fault management is done at device level and network level.

Like performance management, a fault can occur anywhere on the network, be it the hardware, the cabling or your WAN links. What's important in fault management is to identify the key points and devices on your network, and monitor those for faults. It could be your Internet gateway, router or even the switches on various segments of your network.

You need to ensure that you're ready to tackle the problem as soon as it occurs so there's minimum downtime. For instance,

you might want to go for a hardware network probe, or a link tester to detect broken links. Fluke Networks, for instance, has several hardware probes that can detect faults and traffic patterns on a network. Then, of course, you need to configure SNMP and RMON so that you get automated alerts.

Cisco has something called CiscoWorks Fault History, a browser-based management utility that provides detailed information about alerts and faults that are detected by CiscoWorks DFM (Device Fault Manager). DFM can be used as a fault subsystem even for non-Cisco device vendors. DFM monitors devices according to its type under different conditions through ICMP (Internet Control Message Protocol Polling) and

SNMP.

Advertisment