From the last few months, we have seen many types of
clusters, such as SSI-based like OpenMosix and MPI-based like Oscar and Flash
Mob etc. If you search over the Net, you will find there are quite a few
different kinds of cluster products available out there. Some have a graphical
front end to monitor the nodes and some even don't have one. Lets take my
favorite example, OpenMosix. This one does have a graphical monitoring
application called OpenMosixView, but have you ever noticed that if the number
grows to something around a hundred nodes, then how difficult it becomes to
monitor them? Plus it only shows you the current RAM and CPU utilization of the
nodes. What about the disk usage? Or if in case, you want to see what was the
CPU utilization was in the last one hour or day, then?
These are some things which are very difficult to monitor
in case of large Grids or Clusters. To make things even worse, let's say you
have multiple clusters, one is SSI-based and the other an OSCAR or ROCK with MPI
support. And you want to monitor both of them from one place. Then, what will
you do?
And that is why this time we realized that it's not just
enough to describe different clustering techniques, but it is also important to
and include some applications with which you can actually go forward and manage
those huge clusters with ease. So this time, we took one of the most popular
Cluster and Grid monitoring tool called Ganglia. To give you an idea, this
product is so popular that just about every company that has a Grid or HPC would
be using this application in some manner or the other. The application is
Nix-based, but to my surprise, they have Microsoft on their user's list. I am
not very sure where exactly Microsoft uses it but Ganglia website says
they do. For more info on it, go to http://ganglia.sourceforge.net/
and check out the 'Who uses Ganglia' sections. You will see all the big
names like NASA, Cray, Sun,
Boeing
,
US
Air Force, etc.
|
What is Ganglia?
According to Ganglia's website, it is a scalable distributed monitoring
system for high-performance computing systems such as Clusters and Grids. It is
based on a hierarchical design targeted at federations of clusters. It uses
widely used technologies such as XML for data representation, XDR for compact
and portable data transport, and RRDtool for data storage and visualization. It
uses carefully engineered data structures and algorithms to achieve very low
per-node overheads and high concurrency. The implementation is robust, has been
ported to an extensive set of operating systems and processor architectures, and
is currently in use on thousands of clusters around the world. It has been used
to link clusters across university campuses and around the world and can scale
to handle clusters with 2000 + nodes.
In slightly simpler terms, Ganglia is an application with
which you can monitor any kind of Cluster or Grid which runs on any Nix
platform. You can even monitor different types of Clusters from one installation
of Ganglia and from a single front end.
Here is a live site, where you can see Ganglia working on over 800 nodes cluster (Courtesy: rockscluster.org) |
What can it do?
Broadly, the software is capable of monitoring average and individual CPU,
memory, swap, and disk usage by the cluster nodes. So just by looking at the
front page of Ganglia, which is essentially a dashboard, you will be able to see
how busy or free is your Cluster and what was its performance and utilization in
the last one hour. Not only that, you can also see and monitor the job queue for
the Grid.
The representation is completely graphical and is really
easy to understand and use. It can do many other great things which we will see
later on. But if you want to see the thing working before you go forward and
deploy it in your cluster, then go to http://ganglia.info/?page_id=47 and select
one of the two demo clusters. For instance, the second option is a Ganglia
deployed over a Rocks Cluster, which has around 900 nodes. Here you can actually
go, and play around and to how it works.
Installation nightmare
Now comes the main part of configuring and installing the software. You will
get to know why I am calling its installation a nightmare. But before you start,
be sure that it is not going to bother you as much as it bothered me because, I
did not have any documentation handy which could have told me about the real
cause of the problem while I was installing it. And to solve the issue, I had to
search in thousands of forums and help-pages before I figured out the solution.
When I first saw the application and thought of doing an
article on it, I just downloaded three RPMs from 'http://sourceforge.net/project/showfiles.php?group_id=43021&package_
id=35280 and installed them on top of a full installation of PCQ Linux
2006 and restarted my webserver (Apache). Then, I opened http://localhost/ganglia.
and to my surprise, the thing worked without any configuration and showed a
single CPU single node cluster. I was so happy that I immediately isolated two
nodes from my earlier OpenMosix cluster, which essentially has a full
installation of PCQ Linux 2004 having OpenMosix support out of the box. I used
PCQLinux 2004 for the
OM
cluster because we still have a stable release of
OM
under kernel 2.4 version. And I am sure that most people out there
will use 2.4 kernel based clusters only because it's more stable for this
purpose.
Then I installed those RPMs on top of the two node OM
Cluster. To do so, I first downloaded three of them from the following sites:
http://prdownloads.sourceforge.net/ganglia/ganglia-gmetad-3.0.3-1.fc4.i386.rpm?download;
http://prdownloads.sourceforge.net/ganglia/ganglia-gmond-3.0.3-1.fc4.i386.rpm?download;
and
http://prdownloads.sourceforge.net/ganglia/ganglia-web-3.0.3-1.noarch.rpm?download
The first one is Ganglia Meta package, the second one is
the main monitoring daemon package and the last one is the web interface. After
downloading, run the following in sequence to install them:
#rpm —ivh
ganglia-gmetad-3.0.3-1.fc4.i386.rpm
#rpm —ivh ganglia-gmond-3.0.3-1.fc4.i386.rpm
#rpm —ivh ganglia-web-3.0.3-1.noarch.rpm
Now you have to check whether you have RRDtool's latest
version installed on your machine or not. If not, then you have to download it
from http://people.ee.ethz.ch/~oetiker/webtools/rrdtool/pub/rrdtool.tar.gz
and then install it as follows:
#gunzip rrdtool.tar.gz
#tar —xvf rrdtool.tar.gz
#cd rrdtool
#./configure
#make
#rpm —e rrdtool (to de-install any older rrdtool if present)
#make install
The Ganglia time bug There is a bug in because the HPET timer |
This will install RRDtool on your machine. I didn't
install these commands on top of PCQLinux 2006 installation, because it already
has the latest version of RRDtool in it. The second thing I did was to install
the ganglia-gmond rpm to the second cluster node. This is because Ganglia is an
agent based monitoring tool and the gmond daemon should be installed on top of
all the nodes, which you want to monitor with Ganglia.
After doing all this, I thought I was ready with
Ganglia. I restarted the web server on the first node where I had installed all
the RPMs including RRDtool and the ganglia web package. And then fired up the
browser and tried to connect to http://localhos/ganglia. The site opened, but
alas! there were no graphs on it. From here on, my nightmare started.
The site and its links were fully working except that there
were just two cross marks in place of graphs. Then I went on to hunt for the
solution.
And it took me around 4 hours to fix it. It was nothing but
the problem with the clock synchronization inside the kernel. For more details
on the bug, read the box item 'The Ganglia time bug'. And the solution was
as simple as adding 'notsc' just after the 'LABEL=/' in the kernel
parameter in the /boot/grub/grub.conf file. Instantly after doing this change
and restarting my machine, I opened up the Ganglia web interface. I found
everything was in place and working very smoothly.
I was impressed by the software to such an extent that I
have decided to make a customized distro for all my cluster
OSs
with Ganglia built into it. But for that, you will have to wait for the future
issues of PCQuest. I am also not sure when I would do that, but if you feel such
a distro is worthwhile, then write to us, and we may prepone the schedule for
that article and make it next month.
Using Ganglia: The sweet dream | |||
1 |
The is the front page of Ganglia webpage, when you open http://localhost/ganglia. From here, you can see and monitor the average CPU and memory utilization of the Cluster Grid, you can also monitor the job queue |
2 |
Clicking on any of the graphs on the previous slide will take you to this page. It will show you some more details about the cluster, which includes the network and load on the Cluster in the last one hour |
3 | When you click on the 'Choose Node' drop down menu, you will find the IPs of all the nodes. Selecting one will display all details about that particular node, average load on it, memory and CPU utilization, etc |
4 |
Click on Node View link at the top of the page. You see a box which has detailed view of the node. It includes details about the software such as kernel version, swap space, uptime, etc and hardware such as, disk space, utilization, etc |
Anindya Roy