Disaster Recovery Planning: Ensuring Business Continuity in the Face of Disruptions
Any discussion about Recovery should start with the question of what disaster is. Another critical issue is what threats to your IT environment are. We address and respond to these questions and create a plan for handling this.
Disaster
Disaster is any event that causes whole system interruption. Or which could cause it. Sometimes, it could be a failure of one critical machine, and at other times, it could be a network problem or power loss.
The critical thing about researching disaster events is finding as many cases as possible. Only then could we find our critical vulnerabilities and find a way to cover them. Of course, predicting every situation is impossible, but that research could show us a pattern for most of events.
Disaster and Business
When you manage your business, your infrastructure vulnerabilities could critically impact your profits or whole activity. Business Impact Analysis is a process that provides us with information about challenges for infrastructure and business enterprise. It is a vital document that you need to create. It will be the basis for all plans for Recovery or management of disasters.
Another critical aspect of modern business is documentation. Gather as much information as possible to compare your infrastructure with your BIA. It could answer questions about how to manage any problems with the continuity of running the business. This led us to the next topic.
Disaster and Infrastructure
In a typical environment, there are a few areas that should be a point of interest for you:
- Compute:
◦ Physical hardware – all systems that use hardware like servers or workstations,
◦ Virtual Environment – all environments which use virtual machines or containers;
- Storage – hardware and software systems where you can store files or objects
- Services – all services that could be used: databases, applications, or communication
- Network – all hardware and architecture used to communicate between other components and the external world.
As we said above, we should cover as many issues as possible, but our Disaster Recovery Plan should go through guidelines of possibilities for Recovery. We have a few ways to do it:
- Backup and restore – it applies for this kind of objects, which have data (machine volumes, storage, database, binaries, documents)
- Redundant and High Availability – for all parts of the infrastructure that allow the circulation of data (power supply, network medium)
- Documentation – Authorization tokens, Infrastructure schema, and any knowledge that keeps your business in good shape
- Workaround – all solutions, which could be used not to fix but to keep all in a move
Documentation is also something to back up as a tool to save information.
Disaster and Recovery
How can I compile it into one plan? All information you gather, and sort by the above rules will allow you to take every part of the environment into the correct category for Recovery. We must note that it depends on what category a given piece of infrastructure belongs to depends on individual needs and priorities. We could divide it all like here:
- Critical and urgent – all components and procedures that cause your enterprise to stop at the moment. The best option to cover it is redundant infrastructure or High Availability setup (for example, more than one power line, network connection, and collocation for servers or fail-over clusters)
- Critical non-urgent – all components that could severely affect your environment but have a workaround or are not often used. It would help if you made a plan to restore it from backups or recreate it from documentation (for example, an old version of VM, power supply with UPS)
- Non-critical and urgent – all components that need your attention but allow you to run your enterprise but make it less efficient. It would help if you had procedures for every of these scenarios. The plan could use a first workaround, then restore some backup (for example, failure of one disk in RAID5 or 6, failure of one of the APs in a mesh network)
- Non-critical and non-urgent – all components that have no or minimal impact on your business. You should monitor them and create a process for improving the situation, but only in case all of the above is resolved. (Physical indicator for network component, network device without PoE)
If we have categorized all events and made procedures for them, we could compose them into a Disaster Recovery Plan.
Disaster and Practice
It is always good if the Disaster Recovery Plan is only noted, but we should always test it. This is a very long way:
- create a test environment
- check every procedure in a test environment
- simulate full-Disastersaster in a test environment
- gather feedback and improve the plan, then test it again
- create a plan for periodically testing your DRP
The essential thing is to gather feedback the whole time and improve the plan continuously; the environment could change, and our plans should be to
Best practices
In the end, Dew’s best practice, which you can use:
- Your backup system should be independent of backed-up infrastructure. For example,
the backup destination should be outside of the backup system;
- If possible, use collocation – run your services in multiple places. Failure of one does not affect the rest of them;
- Always create a test environment to test your setup;
- Use backups and snapshots. It allows you to restore objects in an exact state.
- Precision is a key
Conclusion
We hope this short article lets you take a first look at how to create DRP. The most important thing is to do good research inside infrastructure and set priorities. All other is a consequence.
02:08 PM, Feb 02
Author:
IT Systems Specialist
Krzysztof Szawara