Run Consul-Terraform-Sync with high availability

An enterprise license is only required for enterprise distributions of Consul-Terraform-Sync (CTS).

This topic describes how to run Consul-Terraform-Sync (CTS) configured for high availability. High availability is an enterprise capability that ensures that all changes to Consul that occur during a failover transition are processed and that CTS continues to operate as expected.

Introduction

A network always has exactly one instance of the CTS cluster that is the designated leader. The leader is responsible for monitoring and running tasks. If the leader fails, CTS triggers the following process when it is configured for high availability:

The CTS cluster promotes a new leader from the pool of followers in the network.
The new leader begins running all existing tasks in once-mode in order to process changes that occurred during the failover transition period. In this mode, CTS runs all existing tasks one time.
The new leader logs any errors that occur during once-mode operation and the new leader continues to monitor Consul for changes.

In a standard configuration, CTS exits if errors occur when the CTS instance runs tasks in once-mode. In a high availability configuration, CTS logs the errors and continues to operate without interruption.

The following diagram shows operating state when high availability is enabled. CTS Instance A is the current leader and is responsible for monitoring and running tasks:

Consul-Terraform-Sync architecture configured for high availability before a shutdown event

The following diagram shows the CTS cluster state after the leader stops. CTS Instance B becomes the leader responsible for monitoring and running tasks.

Consul-Terraform-Sync architecture configured for high availability before a shutdown event

Failover details

The time it takes for a new leader to be elected is determined by the high_availability.cluster.storage.session_ttl configuration. The minimum failover time is equal to the session_ttl value. The maximum failover time is double the session_ttl value.
If failover occurs during task execution, a new leader is elected. The new leader will attempt to run all tasks once before continuing to monitor for changes.
If using the Terraform Cloud (TFC) driver, the task finishes and CTS starts a new leader that attempts to queue a run for each task in TFC in once-mode.
If using Terraform driver, the task may complete depending on the cause of the failover. The new leader starts and attempts to run each task in once-mode. Depending on the module and provider, the task may require manual intervention to fix any inconsistencies between the infrastructure and Terraform state.
If failover occurs when no task is executing, CTS elects a new leader that attempts to run all tasks in once-mode.

Note that driver behavior is consistent whether or not CTS is running in high availability mode.

Requirements

Verify that you have met the basic requirements for running CTS.

CTS Enterprise 0.7 or later
Terraform CLI 0.13 or later
All instances in a cluster must be in the same datacenter.

You must configure appropriate ACL permissions for your cluster. Refer to ACL permissions for details.

We recommend specifying the TFC driver in your CTS configuration if you want to run in high availability mode.

Configuration

Add the high_availability block in your CTS configuration and configure the required settings to enable high availability. Refer to the Configuration reference for details about the configuration fields for the high_availability block.

The following example configures high availability functionality for a cluster named cts-cluster:

cts-config.hcl

high_availability {
   cluster {
      name    = "cts-cluster"
      storage "consul" {
        parent_path = "cts"
        namespace = "ns"
        session_ttl = "30s"
      }
   }

   instance {
      address = "cts-01.example.com"
   }
}

ACL permissions

The session and keys resources in your Consul environment must have write permissions. Refer to the ACL documentation for details on how to define ACL policies.

If the high_availability.cluster.storage.namespace field is configured, then your ACL policy must also enable write permissions for the namespace resource.

Start a new CTS cluster

We recommend deploying a cluster that includes three CTS instances. This is so that the cluster has one leader and two followers.

Create an HCL configuration file that includes the settings you want to include, including the high_availability block. Refer to Configuration Options for Consul-Terraform-Sync for all configuration options.
Issue the startup command and pass the configuration file. Refer to the start command reference for additional information about CTS startup modes.
```
$ consul-terraform-sync start -config-file ha-config.hcl
```
You can call the /status API endpoint to verify the status of tasks CTS is configured to monitor. Only the leader of the cluster will return a successful response. Refer to the /status API reference documentation for information about usage and responses.
```
$ curl localhost:<port>/status/tasks
```

Repeat the procedure to start the remaining instances for your cluster. We recommend using near-identical configurations for all instances in your cluster. You may not be able to use exact configurations in all cases, but starting instances with the same configuration improves consistency and reduces confusion if you need to troubleshoot errors.

Modify an instance configuration

You can implement a rolling update to update a non-task configuration for a CTS instance, such as the Consul connection settings. If you need to update a task in the instance configuration, refer to Modify tasks.

Identify the leader CTS instance by either making a call to the status/cluster API endpoint or by checking the logs for the following entry:
```
[INFO] ha: acquired leadership lock: id=<ID-OF-CTS-INSTANCE>
```
Stop one of the follower CTS instances and apply the new configuration.
Restart the follower instance.
Repeat steps 2 and 3 for other follower instances in your cluster.
Stop the leader instance. One of the follower instances becomes the leader.
Apply the new configuration to the former leader instance and restart it.

Modify tasks

When high availability is enabled, CTS persists task and event data. Refer to State storage and persistence for additional information.

You can use the following methods for modifying tasks when high availability is enabled. We recommend choosing a single method to make all task configuration changes because inconsistencies between the state and the configuration can occur when mixing methods.

Delete and recreate the task

We recommend deleting and recreating a task if you need to make a modification. Use the CTS API to identify the CTS leader instance and replace a task.

Identify the leader CTS instance by either making a call to the status/cluster API endpoint or by checking the logs for the following entry:
```
[INFO] ha: acquired leadership lock: id=<ID-OF-CTS-INSTANCE>
```
Send a DELETE call to the /task/<task-name> endpoint to delete the task. In the following example, the leader instance is at localhost:8558:
```
$ curl --request DELETE  localhost:8558/v1/tasks/task_a
```
You can also use the task delete command to complete this step.
Send a POST call to the /task/<task-name> endpoint and include the updated task in your payload.
```
$curl --header "Content-Type: application/json" \
--request POST \
--data @payload.json \
localhost:8558/v1/tasks
```
You can also use the task-create command to complete this step.

Discard data with the `-reset-storage` flag

You can restart the CTS cluster using the -reset-storage flag to discard persisted data if you need to update a task.

Stop a follower instance.
Update the instance’s task configuration.
Restart the instance and include the -reset-storage flag.
Stop all other instances so that the updated instance becomes the leader.
Start all other instances again.
Restart the instance you restarted in step 3 without the -reset-storage flag so that it starts up with the current state. If you continue to run an instance with the -reset-storage flag enabled, then CTS will reset the state data whenever the instance becomes the leader.

Troubleshooting

Use the following troubleshooting procedure if a previous leader had been running a task successfully but the new leader logs an error after a failover:

Check the logs printed to the console for errors. Refer to the syslog configuration for information on how to locate the logs. In the following example output, CTS reported a 401: Bad credentials error:

2022-08-23T09:25:09.501-0700 [ERROR] tasksmanager: error applying task: task_name=config-task
error=
| error tf-apply for 'config-task': exit status 1
|
| Error: GET https://api.github.com/user: 401 Bad credentials []
|
| with module.config-task.provider["registry.terraform.io/integrations/github"],
| on .terraform/modules/config-task/main.tf line 11, in provider "github":
| 11: provider "github" {
|

Check for differences between the previous leader and new leader, such as differences in configurations, environment variables, and local resources.
Start a new instance with the fix that resolves the issue.
Tear down the leader instance that has the issue and any other instances that may have the same issue.
Restart the affected instances to implement the fix.