Introduction
Disaster Recovery Planning is the process of preparing for recovery or continuation of IT processing tasks that support critical business processes in the event of a threat to your IT infrastructure. In some cases, IT infrastructure would be recovered in a process that could take days (or weeks) while in other cases processing will continue immediately (or within minutes) at a remote site away from the threat.The Disaster Recovery (DR) planning and testing process is not generally regarded by IT teams as the most exciting task to be involved in, and most would prefer to keep busy with ‘cooler’ projects such as virtualization or some new Web 2.0 technology. But business continuity and disaster recovery planning is critical for an organization and when the worst actually happens, there is always plenty of excitement to go around!
As the world has virtually shrunk to become a global village and business opening and closing times have been replaced with round-the-clock operations, the importance of being prepared to keep a business running in the event of a disruptive situation has become a more visible priority.
The threats to an organization, whether from the increase in political uncertainty on a global scale, decreased stability of national power networks, or the changing climate conditions and related severe weather, have seemingly been increasing over the past decade. Further, new threats are continually looming on the horizon, such as the outbreak of highly contagious diseases, digital blackmail and hacking, and new methods used by terrorists for wide-scale destruction. And in addition to these, there are of course internal threats, whether damage caused accidentally through human error or purposeful damage to data by an employee.
As a result, business continuity and disaster recovery have become more widely known terms and more people in IT are finding that they need to be involved in doing their part of ensure that the business can continue when something goes wrong.
What's in a name?
In some countries, Business Continuity is not a widely used term and Disaster Recovery is used to refer to recovery of the business as a whole, not just the IT infrastructure. However, I prefer to refer to Business Continuity as the planning for keeping the business as a whole running and Disaster Recovery as a subset of Business Continuity, referring to the task of keeping IT infrastructure available or recovering the IT infrastructure required for critical business processes.
Disaster Recovery is not however the best and most descriptive term of what DR planning and implementation actually involves. Disaster Recovery would seem to describe recovery of a resource after a major incident such as flood or fire. In reality though, many incidents that cause disruption to IT infrastructure (and therefore the business) are relatively minor events such as corruption of data, the accidental deletion of a file, or a hardware fault. As a result, a DR Plan should provide guidance in the event of a major disaster or a minor disruption.
Business Continuity Planning is about focusing on the organization as a whole and a good Business Continuity Plan (BCP) should refer to all aspects of the business, including people, premises, facilities and IT infrastructure. A BCP needs to cover any significant risk to the organization from events such as loss of a branch due to a fire or loss of a key staff member, through to contamination of the business’ products and resulting damage to the organizations reputation or a national outbreak of a highly contagious disease that affects large parts of the employee base.
The IT department should not create the BCP (although even today many businesses think of business continuity and disaster recovery planning as an IT only function), but IT should take the objectives specified by the business in the BCP and create a DR Plan that aims to meet those objectives.
First Steps - RTO, RPO and Strategies
The two primary objectives that need to be determined for business processes (such as the process of taking an order from a customer or the payroll process) is the Recovery Point Objective (RPO) and Recovery Time Objective (RTO) for each process. Once these objectives are determined for business processes these must be mapped to the underlying IT systems and infrastructure that support these processes.
Recovery Point Objective (RPO)
The Recovery Point Objective (RPO) is a measure of how much data can be lost when a disaster occurs. This is effectively the difference in time between when the disaster happens and when the last backup occurred.
If a disaster occurs at 4pm and the last backup took place at 8pm the night before, then all transactions that took place in those 20 hours will be lost. Depending on the system and the organization, these transactions will most likely need to be recaptured and the business needs to plan for how the information required to recapture those transactions will be determined.
The amount of time that would be required to recapture those transactions must also be considered.
In the case of an email server, the business may determine that loosing up to 24 hours worth of emails will be acceptable to the business and they may accept that the lost emails may never be recovered.
For a transaction system that captures a high volume of customer orders, the business may decide that it would not be possible to find the information required to recapture those orders and that it is critical that no customer order be missed and therefore that no data can be lost. This will effectively set an objective of zero loss for the RPO.
Recovery Time Objective (RTO)
Essentially the Recovery Time Objective (RTO) is the measurement of how long the business can survive without the systems being in place to run the specific business process. This may vary from zero (ie, the underlying systems always need to be available) or could run as long as days or weeks (if there are sufficient manual processes in place for the process to continue for this length of time without the systems).
While some processes may only be run at certain times – such as a payroll system generating payslips and making salary payments – these systems may still have a very low RTO since a disaster can occur at any point in time. For example, if the payroll system was unavailable just after payday, the system may not be required again for close to 30 days. However if a disaster occurred the day prior to the monthly salary run, then the system would need to be up within 24 hours.
Strategies
Once you have determined the RPO and RTO for all systems, you need to use that information to select a strategy for recovery.
A common way to look at the primary strategies for disaster recovery is based on definitions of the primary off-site recovery centre facilities that these strategies use.
A cold site recovery site would provide the most basic of infrastructure at the recovery site with no actual systems. For example, the cold site may have air-conditioning, computer cabling, raised floor, etc - but there will be no systems permanently available at that site. When a disaster occurs, new systems need to be delivered to the site and only then can recovery start. Obviously the time to recover will be relatively long (days or weeks - depending on how long it takes to for backup systems to be delivered to the site) and therefore this strategy is seldom used and is not recommended.
A warm site is an off-site recovery facility that contains data center facilities (backup power supply, cabling, air conditioning, etc) and permanently available IT infrastructure (servers, disk, networking, etc).With warm site recovery there may be some data mirroring or replication, although mostly restores will be done from a backup. Recovery time would generally be hours or days to recover and this is a common strategy.
A hot site recovery facility contains dedicated hardware that can be ready to take over production system processing immediately, within minutes or within a few hours at most. With hot site recovery the data required to continue operations is generally replicated to the recovery site and so is available virtually immediately. A hot site recovery strategy is fairly expensive but is the only acceptable strategy for very low (or near zero) RTO and RPO objectives.
Disaster Recovery Documentation
Once you have determined your recovery strategies, you need to start development of your DR Plan and Procedures. The two primary sets of documentation that you will have will be the master plan and the detailed recovery procedures.The master plan should not be a technical document but will cover things such as declaring a disaster, notifying the disaster recovery team members (including a list of full contact details for the team), information on the recovery site, an overview of the priority and order for recovery, etc.
The detailed recovery procedures will be the technical procedures required to recover your systems. This will include failing over to the backup system (in the case of a high availability system) or how to build the operating system and restore applications, databases and data.
Technical recovery procedures should be written in a way that they are detailed enough for another technical resource (with the relevant technical experience) to perform the recovery, without presuming that the technical resource has any knowledge of your specific production systems or your recovery testing.
Disaster Recovery Testing
Disaster Recovery Testing is a critical and on-going part of Disaster Recovery Planning. For a DRP to be trusted and a plan that you can rely on when things go bad, you need to ensure that you test your recovery plan on a regular basis (generally all systems should be tested at least once every 6 months).During DR Testing you will need to use your recovery plan and technical procedures to recover your systems (or fail over to backup systems in the event of a high availability strategy). Your first DR tests may not meet all your objectives as far as getting all applications and data fully recovered and functional, however they will provide an opportunity to revise your documentation and to resolve any issues discovered with your backups or high availability system.
Conclusion
Preparing for Disaster Recovery is an on-going task and it's not easy. You need to ensure that you do the groundwork to determine the objectives that your business requires, select an appropriate strategy, create useful and accurate documentation and then regularly test, refine and improve your plan.For additional help in developing a Disaster Recovery Plan, refer to the tutorials at www.disaster-recovery-guidance.com.





Comments
Write New Comment ▼
Write New Comment
Sorry! This knol's owner(s) have blocked you from editing, making suggestions, or commenting here.