Consul disaster recovery considerations
Note
This guide applies to Consul versions 1.8 and above.
Disaster recovery considerations are an important part of a company’s overall business continuity planning. Both Recovery Point Objectives (RPO) and Recovery Time Objectives (RTO) should be considered. With Consul, there are several important considerations to keep in mind when planning for disaster recovery.
Note
Consul snapshots contain extremely sensitive data (e.g. credentials in recoverable form) and therefore should only be stored on an encrypted medium with sufficiently strict access controls in place.
The recommended method of backing up Consul state is to use the built-in Consul snapshot command accessed via the API or CLI. This will ensure a consistent and atomic point-in-time capture of the Consul cluster state. The next tutorial, Backup Consul Data and State covers this in more detail.
For Consul clusters deployed to HashiCorp Cloud Platform (HCP), the recommended method of backing up Consul's state is to use the snapshots functionality in the related cluster section in the HCP Consul Dedicated UI. Taking snapshots using the Consul agent or using the automated backups feature of Consul Enterprise agents is also possible.
Backup Storage
Although block or file-based disk backups (e.g. AWS Volume snapshots, Azure Backup) can be an effective way of backing up other services, they cannot reliably restore a Consul cluster to its original state.
Snapshots should be stored on mounted or external storage, instead of local or ephemeral storage.
Backup Management
Snapshots should be taken on a regular basis. They can be automated through the use of scripting, 3rd party tools, or the automated backups feature of Consul Enterprise.
How frequently snapshots are taken should align with the RTO and RPO set forth within each customer’s disaster recovery policies.
Note
Consul Enterprise offers the native ability to automatically back up Consul state to cloud storage (e.g. AWS S3, Azure Blob, etc.), via automated backups.
- Retention policies should be configured on mounted or external storage to align with the customer’s data retention and compliance policies.
Restore
Customers should regularly test and validate the restore process for critical systems to ensure that everything works as expected. This is typically defined within a Disaster Recovery Plan (DRP).
In the event of a loss of quorum in the primary Consul cluster or a complete loss of the primary Consul cluster, to restore you will need to build a new Consul cluster from the latest snapshot. Consul client agents may also need to be re-installed/reconfigured. Automation technologies can greatly reduce the amount of time that it will take to perform these steps. Keep this in mind when considering RPO and RTO.
When you restore a snapshot to a new Consul cluster, Note the following behavior:
If token persistence was enabled on client agents prior to performing a snapshot restore, then the client agents will continue to function after the restore.
If the ACL token is specified directly in the client agent configuration prior to performing a snapshot restore, then the client agents will continue to function after the restore.
If neither of the above options were configured prior to a snapshot restore, then client agents will not function after the restore as they will not have permissions to register with the new Consul cluster.
- This would be relevant if the ACL token was set via the API/CLI or if the ACL token was set via an environment variable.
Token persistence enabled | ACL token provided in Consul client config | Consul client requires a re-configuration |
---|---|---|
Yes | No | No |
No | Yes | No |
Yes | Yes | No |
No | No | Yes |
Region failure
In the event of a total region failure, Consul and your services are likely down. To architect against this situation, deploy Consul and your services in multiple regions with a global failover policy that reroutes network traffic to the alternate region during a disaster. Deploying identical Consul servers and services across multiple cloud regions satisfies datacenter latency requirements and limits the blast radius during large-scale disasters.
Service failure
To architect against outages caused by disasters that impact services registered with Consul, use cluster peering failover with sameness groups. With this setup, Consul can transparently failover requests to an unhealthy service to the same service in a different region and datacenter.
Glossary
Recovery point objective (RPO)
A recovery point objective is defined as the maximum amount of data loss that can be incurred from a disaster, failure, or comparable event. This is measured as a unit of time and there is usually a 1-to-1 correlation between RPO and backup frequency.
Recovery time objective (RTO)
A recovery time objective is defined as the amount of time that passes between application failure and full availability which includes how much time it takes to recover from the disaster or failure. This could be relatively short for a customer that already has another datacenter location available for disaster recovery purposes and replication of services and data occurs on a regular basis. Additionally, customers that leverage automation technologies may be able to recover more quickly from a disaster.
Disaster recovery plan (DRP)
A formal document created by an organization that contains the processes used to recover access to systems and data after a catastrophic event. DRPs will typically also include a set of processes for testing and validating disaster recovery procedures and establish a regular cadence for these events.