Clustering means linking several servers together, through
special hardware and software, so that they appear as one to clients accessing
them. They share the entire load amongst themselves, so if one goes down, the
remaining take up its share of the load. This can be useful for applications
requiring 100 percent uptime, Websites for example.
Clustering ensures that your services never go offline. Even
if a service running on a server fails, it’s resumed on another server.
Administering a cluster rather than multiple servers becomes easier, as you
would manage one entity.
Moreover, as the load increases on a cluster, you can scale
it up by adding more processors or computers.
Clustering configurations
Cluster hardware configurations vary depending on the
technology and the operating system used. It comes in three flavors:
Shared Disk: This approach utilizes central I/O devices,
accessible to all computers within the cluster. They rely on a common bus for
disk access. Because all nodes are writing data simultaneously to the disks,
data integrity is difficult to maintain. Thus, clustering software is required
to maintain the coherence of data.
Shared disk clusters provide high system availability, as
even if one node goes down, the others don’t get affected. On the downside,
these kinds of clusters suffer from inherent bottlenecks involved in shared
hardware that can affect performance. Shared Disk clusters are typically used by
Oracle and AIX.
Shared Nothing: In these clusters, there is no central data
storage. All nodes work independently with their own disks, but have the
capability to take over the functioning of other disks, in case the node
handling those disks ceases to function. They typically use a shared SCSI
connection between the nodes. This type of clustering is not to be confused with
the "shared disk" approach, since here there are no concurrent
accesses being made to these disks by multiple nodes. Shared Nothing cluster
solutions include MSCS (Microsoft Cluster Server) for Win NT/2k.
Mirrored Disk: Mirroring involves replicating all the data
from a primary storage to a secondary storage device for availability purposes.
Replication occurs while the primary system is online. If a failure occurs, the
fail-over process (explained later) transfers control to the secondary system.
However, some applications can lose some data during the fail-over process. One
of the advantages of using mirroring is that your network doesn’t crash due to
disk failure, nor is there any data loss. However, it may not be economical due
to the redundant disks.
Terminologies and concepts
Members of a cluster are referred to as nodes. The Cluster
Service is a collection of software on each node that manages all
cluster-specific activity. A Resource is an item managed by the Cluster Service.
Resources may include physical hardware devices such as disk drives and network
cards, or logical items such as logical disk volumes, TCP/IP addresses, entire
applications, and databases. A resource is said to be online when it’s
providing its service on a node. A group is a collection of resources to be
managed as a single unit. Operations performed on a group affect all resources
contained in it.
A Group can be owned by only one node at a time. You can’t
have resources within a group owned by multiple nodes simultaneously. If a
particular node fails, then its group can be failed over or moved to another
node as an atomic unit. Each group has a cluster-wide policy about which node it’ll
run on, and the system it’ll move to in case of failure.
In case a node fails, a fail-over process automatically
starts, which is responsible for distributing the workload to other nodes in the
cluster. This implementation differs for different operating systems. When a
node recovers from failure, a new fail-back process ensures that the node gets
back its load.
Clustering in Windows 2000
In the Windows 2000 family, clustering is supported by the
Advanced Server and Data Center versions. There are two flavors of clustering
called MSCS (Microsoft Cluster Service) and NLB (Network Load Balancing). The
first is meant to provide fail-over support for applications, such as databases,
messaging systems, and file/print services while the second distributes load
amongst nodes. MSCS can handle two-node clustering in Advanced Server and four
nodes in Data Center. NLB can go up to 32 nodes in each.
MSCS uses software "heartbeats" to detect failed
applications or servers. In case of failure, it uses the "shared
nothing" clustering architecture that automatically transfers ownership of
resources from a failed node to a surviving node. If an individual application
fails (and not a node), MSCS will typically try to restart it on the same node.
If that also fails, then it moves the application’s resources to the other
node.
NLB as the name suggests, balances the load of incoming
traffic across clusters of up to 32 nodes. One advantage of this setup is that
you can add servers as per your requirement.
Both these clustering technologies can be used in conjunction
for higher availability. Take an example of a large Internet site. You could
have a Web server farm with Network Load Balancing as the front end, while the
back-end, say the database application, is handled by the cluster service.
Clustering under NetWare
Novell introduced NCS (Netware Cluster Services) for NetWare
5, last fall. You can create up to 32-node clusters with the service using the
shared disk architecture. It requires NetWare support pack 4 or higher to run.
You can’t mix NetWare 5.x versions in a cluster. All nodes must be configured
with TCP/IP, and be on the same subnet. Each server needs at least 64 MB RAM and
should be part of the same NDS tree. In addition, each server must have at least
one local disk device (not shared) to be used as volume SYS, and the NDS tree
must be replicated on at least two servers in the cluster. The latest release of
the Cluster Service includes fail-over for DHCP servers, which was not present
earlier.
Anuj Jain