Architecting reliable backup and recovery strategy


I have been managing multiple databases, mostly in Microsoft SQL Server and MySQL server, both on on-premise and cloud. We have faced a lot of challenging issues such as records deleted from a user table, backup file is corrupted, backup file is not compatible, backup files got deleted, backup storage is full and backup is running for long time, etc. When you are facing this issues for the first time, it is surprising to see new kind of issues every day and if you are not good in documentation, repetitive issues will keep occurring and we keep fixing rather than suctioning. If you are facing same challenges, then you need to focus on your backup and recovery strategy.

A well-designed backup and recovery strategy maximizes data availability and minimizes data loss without tolerating business requirement. In this post, we will discuss about the following topics:
  1. Recovery Time Objective (RTO)
  2. Recovery Point Objective (RPO)
  3. Business Strategy
  4. Backup 
    • Backup Type
    • Backup Frequency
    • Backup Device
    • Retention and Archival plans
  5. Restore
    • Backup Availability
    • Restoration Access
    • Documentation
  6. Testing
Recovery Time Objective (RTO): Time taken to recover the data

Recovery Time Objective defines, how long would an interruption to data access need to be, before the business ceased to be viable / capable of working successfully?

Best Practice:

Backup and Restoration plan that involves quick recovery with small data loss might be acceptable to business than a plan that reduces data loss but takes much longer to implement.
Deciding optimal RTO might involve following things:
  •  Identifying the correct backup media
  •  Identifying the authorized persons to perform restore
  •  Identifying documentation related to the restore
Example:

RTO for the Stock Exchange System might be 1 minutes – which means, in case of any failure on database system, backup should be restored and system should be online within a minute.

Recovery Point Objective (RPO): The amount of data loss

Recovery Point Objective defines how much data loss will be acceptable to business.

Best Practice:

Backup plan should be able to recover all committed data without any loss and it should be acceptable to loss the data which is in process at the time of failure.

Example:

RPO for the banking system might be 1 seconds – which means, in case of any failure, the bank can loss at most 1 second worth of data.

Business Strategy:

Backup and recovery plan should be in sync with business requirement. As a best practice, it should be communicated with business stakeholders and expectations of the business users should be managed in line with the agreed strategy.

Each database must be categorized based on the importance to the core functions of the organization and RPO and RTO for each database must be documented well in advance and communicated to all stakeholders.

Example:

In certain business, Sales database is very critical than Marketing database and needs to restored in high priority, in case of failure. The RPO and RTO for Sales and Marketing databases will be defined accordingly.

The business requirements will determine all aspects of the backup strategy, including:
  •          How frequently backups need to occur?
  •          How much data is to be backed up each time?
  •          The type of media, that the backups will be held on?
  •          Retention and Archival plans for the media? 
Backup Type:

Backup and recovery plan should clearly define the backup type required for different scenarios. A full backup might be appropriate in certain cases and partial backup will be appropriate in other scenario.

For example, Product database may have huge number of records and backup and restoration of a huge backup is time consuming. When you have RTO of 1 minutes, it is not right strategy to have a full backup which needs 1 hour of restoration time. Similarly, when RPO is defined as 5 seconds, we cannot run full backup on every 5 seconds.

Backup and Recovery strategy should address this question. It should choose the combination of different backup type to achieve the defined RPO and RTO metrics.
Generally, there are 3 types of backups are used:
  •  Full backup.
  •  Differential backup. 
  •  Log backup.
Backup Frequency:

Most of the places, the backups are implemented in below frequency:
  •  Full backup – every week
  •  Differential backup – every day
  •  Log backup – every hour

Review the above schedule and think, will it satisfy the business requirement. There is no right or wrong questions, the above frequency might satisfy the RPO of 1 hours and depends on the size of the database and where backup stored, RTO might be achievable. There is no one size fit all approach here, you have to understand the business requirement and define the RTO and RPO.

When business demand no data loss, log backup of every 1 hour will not help, in case any failure, with the above backup plan, you may lose the data worth 1 hour. A well-designed backup and restore plan will address this issue and it should be documented and shared across business stakeholders.





Backup Device:

This is one of the important factors, choosing the wrong backup device, will negatively impact the RTO and business requirement. Generally, backup device can be either one or combination of below devices:
  •  Disk Files – (SAN, NAS, etc.)
  •  Cloud Storage (Azure Blob Storage, Amazon Cloud Storage, etc.)
Best Practice:

Make sure, the backup device is easily accessible and secured. There are instances, SAN administrator mistakenly deleted the backup files while freeing disk spaces and other instances, where DBA does not have access to backup storage device and the person who is having access is on vacation. If you are in Enterprise setup, usually different things are managed by multiple persons, if you are in small or start-up setup, you might have access to everything. Plan your backup and recovery strategy accordingly, where to write the backup and how you can access the backup, when there is a need.

Also, there are compliance policy, where it demands the backup must be secured, might be using encryption algorithms.

Archival and Retention Policy:

Archival and Retention policy should be defined considering the legal requirements and compliance requirement. Depends upon your region compliance requirements may impact how you store and how long you retain the backup.

Restore:

Backup Availability:

There are scenarios, where backup file got corrupted or mistakenly deleted from storage, if you have only one backup copy, it will lead to loss of data. Backup plan should consider this and implement, more than one copy of backup in an accessible location.

Best Practice:

Generally, it is good practice to have more than 2 copies of backup. Keep 1 one copy in the local machine or attached drive, other copy in SAN/NAS storage and another copy in remote location, such as cloud or other geo.

Keeping a copy in local help in reduce the time needed to transfer the file to local machine and it will help in achieving RTO.






Restoration Access:

In a large enterprise setup, everyone may not have access to restore or recover data. These permissions include, access to backup storage, access to restoration and security access, if encryptions are in place. It is advisable to capture all these permissions in Backup and Recovery strategy and share it with business and technical stakeholders.

In case of emergency, everyone knows their role and understand the sequence, when to execute their job. It will remove the confusions and keep the team in calm environment.

Documentation:

I have interviewed a lot of people for the database administration role and I realized, the documentation is least important task in their day to day job. For a DBA, most of the tasks are repetitive and we are using a lot of scripts and automations. Even something developed or written by us few days back will be very difficult to understand later point of time. When you are in emergency, searching on Google or using others scripts will not help you and it seems it is simple to recover the data but it is not. It is mandatory to document each and every steps and business requirements in clear manner and communicate with all stakeholders.

Also, it is important to agree both the business and technical stakeholders on how quickly data can be retrieved and how much data can be lost and get sign-off on the documentation.

Testing:

As John Ruskin said, “Quality is never an accident; it is always the result of intelligent effort.” It does not matter, how perfectly you designed your backup and recovery strategy, unless you tested many number of times and documented all the challenges and resolution, it is going to be surprise at the time of emergency.

As a best practice, restore the backup on testing and staging frequently in a defined interval and restore on the production machine in longer intervals. There are instances, a backup copy may work in one machine with same database engine and configuration, may not work in other machine with same engine and configuration. Make a habit of test, test and test frequently in a defined period.

Summary:

A well-designed Backup and Recovery strategy will ensure the data availability and minimize the data lose. A documented strategy will act as guiding light in terms of emergency. It will remove the confusions and dependency on individuals. A good tested and orchestrated strategy ensures the business continuity and give guarantee to your sleep J.

I hope this helps in planning your strategy, please share your thoughts on this article and let me know, if I need to edit or add content on this article. Thanks you for your time. Be proactive rather than being reactive!!!