Survey Reports

DR and BCP: Strategies for the Unforeseen

PCQ Bureau

01 Oct 2007 10:13 IST

New Update

A few months ago, I had gone to meet the CIO of a large enterprise with

thousands of branch offices across the country. Unfortunately, it happened to be

a particularly bad day for the company. I reached there in the afternoon of a

hot Delhi summer, only to find the entire office in darkness.

Apparently there was no electricity in that area since 6 AM that morning, and
UPSes were draining out rapidly. They were sharing the premises with other

companies, and to my surprise, they also shared the same generator. The shared

generator wasn't powerful enough to supply power to all the companies

simultaneously, so the generator owner was busy cycling power through different

offices. The entire office building was in darkness, and the IT department was

busy switching off all lights and even fans. The reason - they needed power to

keep their data center up and running, since it was running mission critical

business applications, which were being accessed by the company's remote

branches. There was complete mayhem in the building, all because they had not

planned for this, and didn't feel the need of purchasing their own generator.

Believe it or not, this is how disasters strike.

Advertisment

So next time when you're creating or re-evaluating a disaster recovery plan

for your organization, don't just factor in natural calamities like rains,

floods, and hurricanes; or even man made disasters like bomb blasts, terrorist

attacks, etc. You also need to think of other possibilities that are more

closely associated with your office, which could affect your business. For

instance, if the company I described above had foreseen this situation early

enough, then they would have made arrangements for their own generator and

things wouldn't have been this bad. A sound disaster recovery and business

continuity strategy therefore, needs to take into account such adverse

circumstances and much more. First of course, is to have the realization that

it's important to have a DR and BCP plan. You'll notice that we're using both DR

and BCP together because the former is just one part of the latter. It's not

only important to recover from a disaster, but how quickly can your organization

be back in business.

State of deployment in the Data

Center

Advertisment

Importance of a DR and BCP plan

Nobody today can deny the importance of having a sound DR and BCP strategy.

There have been enough calamities to ensure this. In fact, the CIO of an

insurance company based out of Mumbai once told me that July is always a

particularly bad month for Mumbai. Most catastrophes in Mumbai have happened

during this month over the years-bomb blasts, floods, and the CIO even recalled

a fire in his own office building at one time. Interestingly, I had met him in

June, and he was dreading as to what would happen in the coming month. I wasn't

able to follow up later on whether anything really happened (I hope not), but

it's enough to make one realize the importance of being ready with a plan to

recover from a disaster. In fact, a few months ago, we did a survey on data

center management, wherein we asked key CIOs across the country what they've

already deployed and what they're likely to deploy in the near future. DR and

BCP was the second in the list of technologies on their priority list, which

shows its importance as compared to other technologies and solutions.

Incidentally, virtualization was on top of the list, and this technology is also

being used today for disaster recovery. We'll talk about that a little later.

Indian companies reaching out to markets abroad have a very strong reason to

have well documented DR and BCP practices. Most of the customers abroad have it

as a part of their criteria for selecting a business partner in India.

Identify all 'What if' situations

The first step towards a sound disaster recovery and business continuity

strategy is to assess what all can possibly go wrong, and prepare an action plan

for the same. This by no means is a small task, which is understandable because

CIOs are already so busy handing existing problems that planning for future

problems which 'might' occur is not easy to digest. But then, consider it like

doing an insurance cover for self. We first think of all the possible tragedies

that could happen and how they could affect our life in the future. Accordingly,

we prepare ourselves by going for the right insurance plan. A disaster recovery

and business continuity plan requires similar thinking for your business. Think

of what all could possibly go wrong and severely hamper your business.

Advertisment

As most businesses today are powered by IT, you can't afford to have your

mission critical applications go down beyond permissible limits. You need to

identify and prepare a list of such applications and the duration for which

their downtime is affordable. Business applications usually top the chart in

this exercise. While this is definitely true, one must also realize that there

are other fairly important applications as well. Your business could come to a

stand still without them, and believe it or not, Email is one such candidate.

Just bring down your mail server for a while and you'll realize its importance.

Email has become the communication backbone of every organization today, and

nowadays organizations are building messaging and workflow applications around

email. Email is a critical element in a unified messaging setup as well.

Besides applications, you also need to factor in many other elements, such as

the physical IT infrastructure, power backup, etc. Once you have a list of

elements that are critical for your business, you need to prioritize the risks

to those functions. For instance, is your office in a flood or earthquake prone

area? In one case, we found that a company had its data center in a high rise

building, and that too on the top floor. The company was really worried after

the 9/11 attacks on the World Trade Center, and wanted to move its data center

to another location immediately. In case of power backup, you need to see how

well will your setup work during a prolonged power outage. This fits directly

into the example I gave in the beginning.

Involve all stakeholders

A DR and BCP plan can't be created in isolation, because the whole

organization depends on it. Therefore, the CIO needs to find out from all

business unit heads what's important for them in an emergency. Depending upon

that, you need to setup disaster recovery teams for each individual unit. You

then need to determine what the team will do, how will it communicate, what

information will it require, etc. These teams would also be responsible for

generating awareness amongst their groups about disaster recovery and its

importance.

Advertisment

Disaster Recovery@ Mahindra &

Mahindra Financial Services Ltd

Suresh Shanmugam,

National Head, Information Systems & Technology, Mahindra and Mahindra

Finance

Q. How does an organization like yours,

which has a vast, widespread IT infrastructure, ensure business continuity?

DR is not only about data for us. It means ensuring infrastructure and

resource availability in remote locations. Plus, it means ensuring our

central site is available to the remote sites. While we have put in measures

to ensure resource availability at central site, we face a lot of problems

in remote areas, primarily due to non availability of power, and other

resources. We train our staff in the remote sites on how to effectively use

information and the IT infrastructure during crisis situations. Being a

customer centric corporation, Mahindra is concerned about the green

revolution. We're therefore considering using environment friendly energy

resources like solar panels or wind power in rural areas.

Q. How frequently do you test your DR setup for effectiveness?

As a process, everyday the backed up data is restored with out fail and

made available for MIS users. Ever since we've started using hand held

devices on the field, there's an increased demand for our central site to be

available. The transaction information that comes from these handheld

devices is also moved to web servers for cross checking. Plus, our hardware

controls are proactively monitored / maintained.

Q. What are the key things to keep in mind when implementing a DR
STRATEGY?

In our case, it's the following:

1. Business continuity-Our goal is to ensure that our rural customer get a
loan in two days

2. On line Corporate Customer Care Service

3. Availability of our front end applications and communication information

4. Keeping on line connectivity available at all locations for hand held
devices to connect

Formulate an action plan

Once you've identified your mission critical elements and the stakeholders,

you need to prepare an action plan. This should include everything that will be

done for mitigating the risks that have been identified. For instance, most

organizations realize the importance of having a remote DR site, but what sort

of a site should it be? Should it be an exact replica of the parent site, and

always remain in active-active mode? If so, then the site will always remain

updated with the latest data, and will seamlessly take over if the primary data

center goes down. Such a proposition is the most expensive, because it requires

the exact replicas of your mission critical applications and other parts of the

IT infrastructure. In order to ensure that the data is backed up, WAN

connectivity is required, with redundant paths. Here again, the choice of a WAN

communication partner becomes important. You need a partner that can provide you

the bandwidth required as well as redundant links. The other type of remote DR

site is called active-passive. Here, the data is not updated live, but

periodically.

Advertisment

Use technology to reduce DR costs

A traditional DR site is an exact replica of the primary site, making it

very expensive to host and maintain. The cost of servers and storage alone

becomes prohibitive. It's only natural therefore, to be uncomfortable setting up

a DR site that will remain idle most of the time, kicking in only when disaster

strikes (that may or may not happen). Most CIOs therefore find it very difficult

to convince their top management to invest in a DR site. Technology has evolved

over time to address this concern, and CIOs need to take them into account when

developing their DR strategies. Let's look at a few of them.

Disaster Recovery@HDFC Standard Life

Insurance

Sunil Rawlani,

CIO, HDFC Standard Life

Q. Mumbai has particularly been affected

by many disasters over the recent years. Flooding due to heavy rains, chaos

due to bomb blasts, etc. Given such a situation, what plans have you put in

place to ensure business continuity?

We have an IT Outage Recovery plan. The plan starts with a simple power

outage and goes up to an extended outage. Our BCP objective is to ensure

business interruptions are as short as possible and is within a reasonable

range of financial consideration. We have varying recovery plans for defined

levels of outages. We own a Disaster Recovery site outside Mumbai and also

facilities for business processing.

Q. How long will it take for your

business to be back up and running if a disaster strikes?

We have a Recovery Point Objective and a Recovery Time objective. Both

are proportional to the extent of disaster, ranging from 24 to 96 hrs. The

DR plan starts with a simple power outage and goes up to an extended outage.

Q. How frequently do you test your DR

setup for effectiveness?

We plan semi-annual DR drills.

Server Virtualization

This is being considered as a very cost effective candidate for DR sites.

Instead of having a one-to-one ratio of servers between the primary and DR site,

server virtualization allows you to have a one-to-many ratio. Let's understand

this a little more.

Advertisment

Server virtualization allows you to run multiple server applications on the

same physical server. Each server application runs on its own desired Operating

System, and remains isolated from others. Virtualization software like VMware or

Microsoft Virtual Server make this possible. It creates a virtual layer on a

physical server, and let's you setup all your applications on this layer as

different virtual machines. Each virtual machine comprises of one server app

with its OS.

Another advantage of using server virtualization is that it makes testing

before deployment very easy. Once you've created a virtual machine of your main

server application, you can create its clone on the same server and test the two

for replication and synchronization. Since they're both on the same physical

machine, the testing is faster. Moreover, you can quickly start and stop one

virtual machine, which is equivalent of booting up or shutting down. This will

allow you to simulate server shutdowns very conveniently. Once you're satisfied,

you can roll it out at the DR site.

Data Replication

This is not a new technology, but it has seen many developments. As the name

suggests, data replication is the ability to instantly replicate data generated

by your applications to remote sites. This is a breath of fresh air over the

previous techniques wherein data would first be backed up to tape and then sent

to the remote DR site. There are data replication solutions available for a

variety of applications, including email servers, databases, and even file

servers. Snapshots are the most common example of data replication. Many storage

devices today offer this capability. They take the data from a particular

volume, and create its snapshot on another volume. This data could also be

replicated remotely to the DR site, either synchronously or asynchronously.

Another type of replication is host based replication, wherein the replication

software resides directly on the application server, and captures all I/O

activity from it to create a copy of the data. Replication software can also be

appliance based or storage based, wherein the software resides on the storage

device or controller.

Advertisment

The choice of replication technology depends upon how many applications and

the volume of data that needs to be replicated. Host based one would be

recommended if you don't have too many applications, else you'll be sitting with

a different replication software for each different machine. This would make it

extremely difficult to manage. Likewise, appliance based replication would sit

on the storage appliance, but might be limited to replicating data across the

same type of appliance on the other side. This is one of the things that every

CIO must check before going for a SAN. Does the SAN allow homogeneous or

heterogeneous data replication. If it's the former, then you would require the

same SAN setup at the DR site, but the latter would allow you to replicate data

to any other storage device. This gives you the flexibility of choosing lower

cost storage devices at the DR site.

WAN Accelerators

With so much data being replicated across a WAN, you need something that can

accelerate the process. Otherwise, you'll be spending huge amounts on the

bandwidth. That's where a WAN accelerator comes in. It helps improve data

replication performance over the WAN, thereby reducing your bandwidth costs.

There can be many other examples of using technology to save your DR costs.

We've given some of the latest ones. Lastly, in order for a DR and

BCP strategy to be successful, it needs to be inculcated into the company
culture. That will only happen if you create enough awareness about

it, and get a buy-in from all the stakeholders.

Disaster

Recovery@Infosys

Pritam Kumar

Sinha,

Sr. Project Manager-IT,

Infosys Technologies

Infosys

recently did a massive consolidation exercise of its data center, wherein

servers, backup, storage, and space were all consolidated. The exercise

helped Infosys save a lot of cost. The backup consolidation exercise helped

put in place an improved and cost efficient disaster recovery strategy,

which reduced recurring costs by up to 95%. The company initially had too

many tape drives for data backup, with no central monitoring, which lead to

high recurring costs. The tapes were also based on outdated technology, so

if a disaster struck, there would have been no way of restore data from

them. The company shifted to autoloaders with centralized backup management

and monitoring. This reduced the backup window and even distressed the LAN.

Plus, the company also implemented a SAN,

and used snapshot technology for doing based backups on it. So in case of a

level 1 disaster, which is basically data loss due to file corruption or

human error, data can be restored from the snapshot in just 10 minutes.

Infosys tests its DR setup for effectiveness once every six months.

Advertisment