A few months ago, I had gone to meet the CIO of a large enterprise with
thousands of branch offices across the country. Unfortunately, it happened to be
a particularly bad day for the company. I reached there in the afternoon of a
hot Delhi summer, only to find the entire office in darkness.
Apparently there was no electricity in that area since 6 AM that morning, and
UPSes were draining out rapidly. They were sharing the premises with other
companies, and to my surprise, they also shared the same generator. The shared
generator wasn't powerful enough to supply power to all the companies
simultaneously, so the generator owner was busy cycling power through different
offices. The entire office building was in darkness, and the IT department was
busy switching off all lights and even fans. The reason - they needed power to
keep their data center up and running, since it was running mission critical
business applications, which were being accessed by the company's remote
branches. There was complete mayhem in the building, all because they had not
planned for this, and didn't feel the need of purchasing their own generator.
Believe it or not, this is how disasters strike.
So next time when you're creating or re-evaluating a disaster recovery plan
for your organization, don't just factor in natural calamities like rains,
floods, and hurricanes; or even man made disasters like bomb blasts, terrorist
attacks, etc. You also need to think of other possibilities that are more
closely associated with your office, which could affect your business. For
instance, if the company I described above had foreseen this situation early
enough, then they would have made arrangements for their own generator and
things wouldn't have been this bad. A sound disaster recovery and business
continuity strategy therefore, needs to take into account such adverse
circumstances and much more. First of course, is to have the realization that
it's important to have a DR and BCP plan. You'll notice that we're using both DR
and BCP together because the former is just one part of the latter. It's not
only important to recover from a disaster, but how quickly can your organization
be back in business.
State of deployment in the Data Center |
Importance of a DR and BCP plan
Nobody today can deny the importance of having a sound DR and BCP strategy.
There have been enough calamities to ensure this. In fact, the CIO of an
insurance company based out of Mumbai once told me that July is always a
particularly bad month for Mumbai. Most catastrophes in Mumbai have happened
during this month over the years-bomb blasts, floods, and the CIO even recalled
a fire in his own office building at one time. Interestingly, I had met him in
June, and he was dreading as to what would happen in the coming month. I wasn't
able to follow up later on whether anything really happened (I hope not), but
it's enough to make one realize the importance of being ready with a plan to
recover from a disaster. In fact, a few months ago, we did a survey on data
center management, wherein we asked key CIOs across the country what they've
already deployed and what they're likely to deploy in the near future. DR and
BCP was the second in the list of technologies on their priority list, which
shows its importance as compared to other technologies and solutions.
Incidentally, virtualization was on top of the list, and this technology is also
being used today for disaster recovery. We'll talk about that a little later.
Indian companies reaching out to markets abroad have a very strong reason to
have well documented DR and BCP practices. Most of the customers abroad have it
as a part of their criteria for selecting a business partner in India.
Identify all 'What if' situations
The first step towards a sound disaster recovery and business continuity
strategy is to assess what all can possibly go wrong, and prepare an action plan
for the same. This by no means is a small task, which is understandable because
CIOs are already so busy handing existing problems that planning for future
problems which 'might' occur is not easy to digest. But then, consider it like
doing an insurance cover for self. We first think of all the possible tragedies
that could happen and how they could affect our life in the future. Accordingly,
we prepare ourselves by going for the right insurance plan. A disaster recovery
and business continuity plan requires similar thinking for your business. Think
of what all could possibly go wrong and severely hamper your business.
As most businesses today are powered by IT, you can't afford to have your
mission critical applications go down beyond permissible limits. You need to
identify and prepare a list of such applications and the duration for which
their downtime is affordable. Business applications usually top the chart in
this exercise. While this is definitely true, one must also realize that there
are other fairly important applications as well. Your business could come to a
stand still without them, and believe it or not, Email is one such candidate.
Just bring down your mail server for a while and you'll realize its importance.
Email has become the communication backbone of every organization today, and
nowadays organizations are building messaging and workflow applications around
email. Email is a critical element in a unified messaging setup as well.
Besides applications, you also need to factor in many other elements, such as
the physical IT infrastructure, power backup, etc. Once you have a list of
elements that are critical for your business, you need to prioritize the risks
to those functions. For instance, is your office in a flood or earthquake prone
area? In one case, we found that a company had its data center in a high rise
building, and that too on the top floor. The company was really worried after
the 9/11 attacks on the World Trade Center, and wanted to move its data center
to another location immediately. In case of power backup, you need to see how
well will your setup work during a prolonged power outage. This fits directly
into the example I gave in the beginning.
Involve all stakeholders
A DR and BCP plan can't be created in isolation, because the whole
organization depends on it. Therefore, the CIO needs to find out from all
business unit heads what's important for them in an emergency. Depending upon
that, you need to setup disaster recovery teams for each individual unit. You
then need to determine what the team will do, how will it communicate, what
information will it require, etc. These teams would also be responsible for
generating awareness amongst their groups about disaster recovery and its
importance.
Disaster Recovery@ Mahindra & Mahindra Financial Services Ltd
Q. How does an organization like yours, |
Formulate an action plan
Once you've identified your mission critical elements and the stakeholders,
you need to prepare an action plan. This should include everything that will be
done for mitigating the risks that have been identified. For instance, most
organizations realize the importance of having a remote DR site, but what sort
of a site should it be? Should it be an exact replica of the parent site, and
always remain in active-active mode? If so, then the site will always remain
updated with the latest data, and will seamlessly take over if the primary data
center goes down. Such a proposition is the most expensive, because it requires
the exact replicas of your mission critical applications and other parts of the
IT infrastructure. In order to ensure that the data is backed up, WAN
connectivity is required, with redundant paths. Here again, the choice of a WAN
communication partner becomes important. You need a partner that can provide you
the bandwidth required as well as redundant links. The other type of remote DR
site is called active-passive. Here, the data is not updated live, but
periodically.
Use technology to reduce DR costs
A traditional DR site is an exact replica of the primary site, making it
very expensive to host and maintain. The cost of servers and storage alone
becomes prohibitive. It's only natural therefore, to be uncomfortable setting up
a DR site that will remain idle most of the time, kicking in only when disaster
strikes (that may or may not happen). Most CIOs therefore find it very difficult
to convince their top management to invest in a DR site. Technology has evolved
over time to address this concern, and CIOs need to take them into account when
developing their DR strategies. Let's look at a few of them.
Disaster Recovery@HDFC Standard Life Insurance
Q. Mumbai has particularly been affected Q. How long will it take for your Q. How frequently do you test your DR |
Server Virtualization
This is being considered as a very cost effective candidate for DR sites.
Instead of having a one-to-one ratio of servers between the primary and DR site,
server virtualization allows you to have a one-to-many ratio. Let's understand
this a little more.
Server virtualization allows you to run multiple server applications on the
same physical server. Each server application runs on its own desired Operating
System, and remains isolated from others. Virtualization software like VMware or
Microsoft Virtual Server make this possible. It creates a virtual layer on a
physical server, and let's you setup all your applications on this layer as
different virtual machines. Each virtual machine comprises of one server app
with its OS.
Another advantage of using server virtualization is that it makes testing
before deployment very easy. Once you've created a virtual machine of your main
server application, you can create its clone on the same server and test the two
for replication and synchronization. Since they're both on the same physical
machine, the testing is faster. Moreover, you can quickly start and stop one
virtual machine, which is equivalent of booting up or shutting down. This will
allow you to simulate server shutdowns very conveniently. Once you're satisfied,
you can roll it out at the DR site.
Data Replication
This is not a new technology, but it has seen many developments. As the name
suggests, data replication is the ability to instantly replicate data generated
by your applications to remote sites. This is a breath of fresh air over the
previous techniques wherein data would first be backed up to tape and then sent
to the remote DR site. There are data replication solutions available for a
variety of applications, including email servers, databases, and even file
servers. Snapshots are the most common example of data replication. Many storage
devices today offer this capability. They take the data from a particular
volume, and create its snapshot on another volume. This data could also be
replicated remotely to the DR site, either synchronously or asynchronously.
Another type of replication is host based replication, wherein the replication
software resides directly on the application server, and captures all I/O
activity from it to create a copy of the data. Replication software can also be
appliance based or storage based, wherein the software resides on the storage
device or controller.
The choice of replication technology depends upon how many applications and
the volume of data that needs to be replicated. Host based one would be
recommended if you don't have too many applications, else you'll be sitting with
a different replication software for each different machine. This would make it
extremely difficult to manage. Likewise, appliance based replication would sit
on the storage appliance, but might be limited to replicating data across the
same type of appliance on the other side. This is one of the things that every
CIO must check before going for a SAN. Does the SAN allow homogeneous or
heterogeneous data replication. If it's the former, then you would require the
same SAN setup at the DR site, but the latter would allow you to replicate data
to any other storage device. This gives you the flexibility of choosing lower
cost storage devices at the DR site.
WAN Accelerators
With so much data being replicated across a WAN, you need something that can
accelerate the process. Otherwise, you'll be spending huge amounts on the
bandwidth. That's where a WAN accelerator comes in. It helps improve data
replication performance over the WAN, thereby reducing your bandwidth costs.
There can be many other examples of using technology to save your DR costs.
We've given some of the latest ones. Lastly, in order for a DR and
BCP strategy to be successful, it needs to be inculcated into the company
culture. That will only happen if you create enough awareness about
it, and get a buy-in from all the stakeholders.
Disaster Recovery@Infosys
Infosys Plus, the company also implemented a SAN, |