Downtime is perhaps one word IT managers dread the most. It's to avoid
downtime that they invest heavily in management software, services, etc. In
today's complex IT infrastructure, there are hundreds of different servers
running, and many are inter dependant on each other. So, even if one goes down,
it won't just bring down one service, but several of them.
Imagine a scenario where your application server goes offline due to a minor
fault. This would affect other business applications that are dependant on it,
resulting not only in downtime, but also business losses. Even though bringing
it back up could be a half-a-minute job, but finding out about it and then
finding the right server in the data center, and fixing it can take longer. If
you have an online business portal, then even this much time can translate into
huge financial losses. So what do you do?
|
That's where a good management solution comes in, which not only reports
services that are down, but also takes proactive action to correct the problem.
We'll take you through one such product called the HP OpenView Operations (HPOVO)
Manager 7.5. This allows you to centrally manage a distributed and heterogeneous
IT infrastructure proactively.
Key functions of HPOVO
The OpenView Operations Manager provides centralized operations management by
detecting problem events, automatically takes action on some events while
sending the other events for management to the processing console. You can set
predefined policy criteria on some of the critical events to take automatic
actions. For example, if for some reason the IIS server stops and an event is
generated and passed to the OpenView Operations Management, its translated into
an event code and the predefined policy rules are executed to restart the IIS
service. Events are key data that tell us about the small or big problems in the
infrastructure. OpenView Operations gives you a framework to present this data
to a centralized repository, so that immediate action could be taken remotely or
manually before it affects the business processes. With this, you can manage
operating systems, applications, middleware (e.g., databases), and services,
allowing operators to work collaboratively in troubleshooting problems.
Here, you can see detailed information about the error. You are also given a remedy to troubleshoot it |
How it works
Before Implementing HP OpenView Operations Manager in live state, it's
necessary to understand how it works. This helps the IT manager to deploy the
system seamlessly. Plus he gets an idea of which services and applications can
be managed effectively using this system. The software is a distributed
intelligent automation management system, which also provides fault management
and workflow. The key steps this system uses are as follows:
1. Collecting data from events log data, general system messages and SNMP
traps: Intelligent agents are installed on all servers. These detect any failure
and performance degradation of any source on the managed system. They monitor
the system and application log files, general system messages, SNMP traps and
variables, hardware components (such as disks and CPUs) and custom variables
from any application. All events are collected, even if the network connection
to the central management station is down.
2. Collection of processing data: The data of collected events is converted
into a standard internal format, regardless of the original source. Then the
irrelevant and duplicated events are filtered out and stored in a central
repository. Events can trigger pre-defined automatic actions, including sending
messages to the operator console.
Processing also includes adding important or critical status information and
grouping events into categories such as 'security' or 'OS.' Using the
built-in notification service, events can be automatically forwarded to other
applications, for example, sending SMS alerts.
You can run commands or scripts to resolve the problem from the same window. This saves time and reduces downtime |
3. Presenting events' data to operator console: The events' data is
presented to Operator console in a consistent format in six different
color-coded severity states, which clearly indicate the severity of failure or
performance degradation.
The operator can drill down to information about available actions and
annotations attached to a message. It also gives an event-specific instructions
guide to the operator with problem resolution process, to quickly resolve a
problem.
4. Action taken: HP OpenView Operations Manager provides flexible mechanisms
to trigger response to every critical event. As said earlier, automated
pre-defined actions can be fired automatically when an event occurs. In order to
facilitate troubleshooting, operators can initiate pre-defined tool actions with
a single mouse-click to fix a problem or to gather additional data such as
services running on the managed system.
All information resulting from the action execution is stored in a central
database to automate the resolution of problems over time. Operators can also
own and acknowledge events or escalate them to other operators and applications.
Plus it remains in the system.
Key benefits
One key benefit of this product is that it supports all possible OS platforms,
be it Windows, UNIX or Linux. Being a centralized point of control for the
network, servers, operating systems, applications and services, the software
makes it easy for a system manager to collect and manage all IT infrastructure
components of a business service.
It has out-of-the-box policy based management intelligence, which can be
extended by using application specific OpenView Operations for Windows Smart
Plug-Ins. The system can be scalable to heterogeneous environments of all sizes,
from 10—1000+ servers. And it has a capability to manage both Microsoft.NET
and J2EE applications.
As soon as the error or event is generated on the managed node, it gets transferred to the HPOVO console |
Implementation and use
The system is pretty straight forward to deploy on your setup. All you need is a
Pentium 4 machine with at least 512 MB RAM, hard disk space of 10 GB and Windows
2000/2003 server running the DNS server. The product has two components: server
and client agent.
First install HP OpenView Operations on the above mentioned setup and install
the agent from the standalone agent CD on the respective client machines that
you want to manage. Both the installations are straight forward using the tool's
wizard. As the system uses a SQL server database for storing information, you
can use its standalone database server, or can point the database server to the
remote SQL server running on your setup. In that case you have to provide the
server name and SA password to the wizard so that it can create the database
structure. Once the setup is over, reboot the system. You are now ready to use
this system.
On the server, launch the HPOVO console, and you will get an MMC divided into
two parts. The left panel contains five components (services, node tool reports,
graphs, and policy management). On the right panel the console will show you
details of components selected from the left panel. Start by adding the managed
nodes (that have an agent running) to this system, in order to monitor them. To
do so, first select the 'Node' from the left panel and right click on it.
After seeing the color of the alert on the left side, operator can see fault details by double clicking on the managed node |
From the context menu again select Configure>node, which will open another
window that's divided into two parts. The left part will show you all
respective networks while right part will show all managed nodes by HP OpenView
operations. As we have used all Windows machines, select 'Microsoft Windows
Networks' from the left panel and drag and drop the machines that you want to
monitor to the right panel. Once the machines are added, the HP OpenView
Operations will hunt for all events raised from the managed nodes and show them
on the console.
Here, we will show how you can troubleshoot a problem remotely using HP
OpenView Operations. For this, let's take a very simple example of a DNS
service, which is running on one of the managed servers. We deliberately killed
the process of DNS.EXE. This raised an alert to the HPOVO console.
Double clicking on the warning alert will show you the error messages. In the
same window, on clicking the Instructions tab, you will be shown a remedy to
trouble shoot the problem. To execute the remedy from the same window, just
click on the Commands section, where you will see the command that will restart
the DNS service.
Just click on the Start button to execute the remedy to the target machine,
remotely. As this is a known case, a pre-defined solution is already available.
But in a real scenario, with different applications running, the errors
generated would only be known to the systems manager, and he also knows how to
work around to tackle those.
In such a case, you can write the course of action in policy rules in any one
or a combination of VB Script, Perl script, WMI script or as a Batch file, for
that error. This reduces the overall troubleshooting time and downtime, plus it
makes the whole system more transparent. The IT manager can track changes at
each level in his IT Infrastructure. This was a small example to show how it
works. You can create far more complex scripts.