by June 7, 2002 0 comments

Post 9/11, Nasdaq came back into business within six days. Now think of all that had to be re-established, the applications, mountains of data and what not. That’s the power of an effective disaster recovery strategy.

That name’s a bit of a misnomer, though. Disaster doesn’t refer only to those of the scale of an earthquake or a terrorist attack; a virus attack on your network or a hardware or software failure is as big a disaster for your data. And these are events that have a high probability of occurring. A Gartner-Dataquest study in 1999 suggested that natural disasters accounted for only 7 percent of unplanned downtime, while hardware and software failures and human error together accounted for 68 percent.

In an age of inter-connected networks and online transactions, any downtime can mean significant losses for your organization. In Nasdaq’s case, for example, losses were in tens of millions of dollars. Your organization may not be of big, but the fact is that you need to have a plan in place that’ll show how your company will come back online when any small or large scale disaster strikes. 

This plan has to be an organization-wide, detailed document and has to include not only key pieces of your business, like online databases, but also LANs, PCs, and telecommunications. That’s what disaster recovery and business continuity planning–the two related buzzwords today–is all about. Together they describe a set of plans and processes that you follow to resume business operations in the event of any sort of unplanned outage.

Building a plan
Your disaster recovery and business-continuity plan has to deal with how essential and other systems can be restored in minimal time. These include telephone lines and your ever-essential Internet connection

Analyze the impact
The first step to building a plan is to do a BIA (Business Impact Analysis). This has to be an enterprise-wide process with representatives from all divisions. Here, you list all the assets in each division and decide which of these are the most crucial to bring your business back online. Then, you list out the potential incidents that could bring your business or part of it, down, including scenarios like one of your external partners’ sites going down. You rate this list of possible disaster scenarios in terms of probability of occurrence and the severity of impact on your business.

List out actions
What will be the sequence of events in the event of a disaster? Who will take charge in what kind of disaster? For instance, it it’s only a server going down, your IT department may be able to take care of that. Which systems from which site will come up first and which later? How much time will this take? Who will be the people driving the process? At what stage would you take the help of specialists? Will you go in for hot sites or mobile units (explained later)? You answer all these questions in detail here–this is like scripting your disaster scenario, where you go into every scene and see what all will need to be done.

Test the plan
Then comes the dress rehearsal–the key players have to test out what they would do in case a disaster struck. The test should also be planned and results audited and documented carefully. These tests should be carried out on all your backup sites regularly.

Maintain the plan
Your plan may change with changes in your business organization. Someone has to be in charge of keeping the plan updated with all these changes.

Safeguard the plan
Just like you do offsite backups and archival of your data, you have to archive your plan too. A common mistake is to leave your disaster recovery documentation in your office, so that in the event of a physical disaster, that’s one of the first data that gets lost! Similarly, you should have extensive contact details of the key people in the plan kept safely offsite, so you can contact them when disaster strikes. In fact, it’s best to have your entire employees’ database backed up and kept safely offsite, along with full data on your systems–configuration, vendor, applications running, data stored.

How to implement it 
Once you’ve determined the critical and other data of your business, you will need various technologies and services that’ll help you restore it in the event of a disaster. Here, there is a tradeoff between how much you can protect and how much it’ll cost you–the more you try to protect, the more the cost. You have to decide this depending on how much cost you’re willing to incur on protecting everything vis-à-vis the cost of losing some of this data or restoring it later. Your plan also has to consider that the data will grow with time.

Here are a few of the commonly used techniques for keeping your data available in the event of a disaster. Various vendors, like EMC, Veritas, or Hitachi Data Systems provide you solutions for these.

Good ol’ tape
Offline backup of data on tape drives have been the most popular method of keeping a copy to restore from. This data should be stored offsite in what are called electronic vaults to manage the huge volumes of data. Also, you should have a copy of this data onsite, so that in case of minor system failures, you can restore it immediately. However, you must test your tape drives regularly, and take care to not expose them to too many cycles. This will ensure that your data is safe and can be recovered when you want. 

‘Hot’ data replication
For applications, where you want 24×7 availability, ‘hot’ replication is used. You should ideally replicate this data to more than one remote site to minimize chances of failure from events like a failed network connection. There are two methods of data replication–synchronous and asynchronous. The former writes data to the replication site before the write operation has completed on the host system. This ensures data accuracy, but can cause delays on the primary system. The latter doesn’t cause any delays on the host and gives optimal performance, but there is a small lag between when the data is written on the host system and when it’s replicated. A technique called journaling is used to ensure that data is written in the same order on the target server as it is on the host.

Apart from replicating your primary data to many secondary sites, another common way of ensuring data availability is to mirror to a secondary site. Here, the data at your secondary site has two other mirrored copies for further safety from events like disk failure on your secondary site. You can also have a third mirrored copy for a point-in-time representation of your data. This is useful when online data gets corrupted due to a virus attack and the same is written to the secondary site as well. You can then use this mirrored copy to restore data to a point before the infection.

You can use clustering with replication for better protection. Here, if one machine in a cluster fails, your applications will automatically failover to the second one in the cluster. Of course, you have to configure the clustering software for when to failover, which machine to failover to, etc. Using both these solutions together will ensure that you’re protected from both local disk failures and system failures, and you have a copy of your live data offsite in case something happens to your site. 

Apart from these techniques of ensuring data availability of your mission-critical applications, you also have to plan for ways of restoring telecommunications and Internet access. 

Getting work on track
The other aspect of a disaster-recovery strategy is to plan for an alternative workspace–an area where your team can work from while the primary site is being restored. A popular way of doing this is to contract with a recovery site vendor. You can ask for a hot site, a mobile unit, or speedy shipment of equipment to your location.

These vendors provide you with various services. They could provide you with a place to house people or equipment. In the case of a hot site, they also provide you with computers and even workgroup spaces that are furnished and have PCs, peripherals, telecommunications, and call management systems. You can also contract for a hot site that you can quickly configure to match your system. Some companies in the US also transfer their critical data to these sites electronically.

The other option is to contract for a mobile unit, where the service provider delivers a working office on wheels to you. The even cheaper option is to go for speedy shipment of equipment–the critical pieces of your business–to you in the event your system fails.

In the end, disaster recovery and business continuity planning is a rational way of ensuring that your business can recover from any disaster . What you need is to keep it on the priority list, plan out a strategy meticulously, test and update it regularly, and carefully document and audit results. An ‘It can’t happen to me,’ attitude can prove to be very expensive in the long run.

Pragya Madan

No Comments so far

Jump into a conversation

No Comments Yet!

You can be the one to start a conversation.