Recovery
Recovery is the process of restoring a system service after a disruption.
The disruption could be:
- A catastrophic loss of facilities, such as fire or flood.
- A hardware error, such as a disk failure.
- A software or human error which results in data corruption or loss.
The last of these is very important. A lot of recovery takes place to reverse out software or human error.
Recovery is closely related to resilience, which prevents systems failing in the first place. Both recovery and resilience are required. Even a totally resilient system must be able to recover after data loss caused by human error.
The basic concept for recovery is to have an independent copy of the system's programs and data, known as a backup, and to use this copy to recreate the service.
There are two important objectives to consider for recovery:
- How old the recovered data will be, and how much data has been lost. This is known as the recovery point objective (RPO).
- How quickly the service can be restored. This is known as the recovery time objective (RTO).
Traditionally backup has involved making daily copies of data to tape, and taking these to another site so that they are safe from catastrophic failure. To make backup quicker, incremental backups are used which only hold the changes from the previous backup.
A daily backup risks losing work that has been carried out during the day. This risk can be reduced by using logging or journalling which captures changes made to the data during the day, and allows these to be applied to the restored data. Most modern databases support logging. It is of course important to write the logs to a different device than the one holding the main database.
As well as regular backups to tape, many systems now use resilient data storage which greatly reduces the risk of data loss and which make recovery of service much simpler. Resilient data storage maintains additional copies of the data on different disks, and in some cases in different locations. Recovery to a point in time is still required, even with resilient data storage, to recover after software or human error.
To be worthwhile, system recovery plans must match business recovery plans. If there has been a catastophic loss of facilities, there is little point in restoring systems if the users of the systems have nowhere to work from.
Recovery is often error prone, and needs to be tested when it is first designed and regularly thereafter. Assume that recovery that is not regularly tested will not work.
Management tips
- Refer to this as "recovery" or "continuity", not "backup", to focus attention on the correct issues.
- There are many vendors of data backup and resilience solutions. Be careful to address the basics of planning and testing, and not get diverted by backup and resilience technology.
- Consider recovery together with resilience, as they are strongly complementary.
See
Resilience
Wikipedia article on Backup
