Consul and chaos engineering
Modern applications are no longer tied to physical characteristics, such as geography, largely due to the proliferation of public clouds and advances in distributed computing facilitated by the Internet. Businesses can deploy applications to different cloud providers around the world to improve application resiliency, reduce latency, and prevent an outage resulting in lost revenue.
Improving application resiliency, however, is particularly difficult. Some organizations attempt to improve resiliency by deploying additional infrastructure (such as load-balancers), abstracting application responsibilities, and/or changing application code. In many cases, however, the solutions introduce new complexity to the application architecture.
One method for improving application resilience without deploying additional infrastructure or making application changes is to deploy applications to Consul service mesh.
In this tutorial, you will learn how Consul can improve application resiliency through the service mesh and automatically failover application traffic to healthy instances. You will also learn about Chaos Engineering, a practice of intentionally stressing your application architecture and validating its ability to handle failures. You will use Chaos Engineering practices to verify Consul improved the application's resiliency and availability.
Additionally, this tutorial was presented live at the HashiCorp Consul office hours.
What is Chaos Engineering?
The following definition comes from https://principlesofchaos.org:
Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production.
In summary, Chaos Engineering is the execution of an intentional test or series of tests that validate your application's ability to handle failure scenarios in production. This can range from hardware failures to various network scenarios that could result in an application failure.
Intentionally breaking your own application, especially in production, may sound counterproductive but conducting experiments to identify weaknesses in your application and underlying system is critical for improving resiliency.
It's costly and difficult to mirror a production environment. Organizations can get close to creating one that mirrors their production environment but there will be some minor differences at the end of the day. To truly identify the weaknesses in the system and application architecture, you must expose the application to these experiments in the environment you want to learn more about, which is production.
Each set of experiments must be accompanied by a hypothesis you are trying to disprove. The more challenging it is to prove the hypothesis wrong, the more confident you can be in your application's ability to handle error scenarios of that type. If you can disprove the hypothesis, you now have information on improvement areas. For this tutorial, you will try to disprove the hypothesis that the application cannot handle a container failure at the backend.
Chaos Engineering is a broader topic than we have room to discuss here. Refer to the following community resource to learn more about Chaos Engineering: https://github.com/dastergon/awesome-chaos-engineering.
Now that you have a background on Chaos Enginnering, you are ready to start experimenting with the example application.
Prerequisites
- Docker v20.10.8 or greater
- Access to the terminal/console
You can also follow along through our hosted lab environment.
Launch Terminal
This tutorial includes a free interactive command-line lab that lets you follow along on actual cloud infrastructure.
Overview of HashiCups
In this tutorial, you will use the HashiCups example application to conduct Chaos Engineering experiments. HashiCups simulates a coffee purchasing application and is deployed redundantly into two different Consul datacenters. Assume each Consul datacenter is located in another geographical region. The desired behavior is that if any service were to go down, it should automatically failover to a healthy service in another Consul datacenter. Below is an architectural diagram of HashiCups.
Review the following diagram to help you understand HashiCups and its network traffic pattern.
Note
Each service container contains the Envoy proxy and the Consul agent.
Setting up HashiCups
To get started with the HashiCups example application locally, follow the steps listed below.
Start by cloning the git repository locally.
Change the working directory.
Checkout the the correct git tag.
Change directory into the HashiCups with Docker Compose example.
Next, create the docker images required for HashiCup with the provided build script. This will take a few minutes depending on your machine.
Next, use Docker compose to start up the application.
Note: Wait approximately 20 seconds for all services to come up. Allow some time for services to register with Consul before verifying through the Consul UI.
Visit the Consul dashboard at http://localhost:8500 and ensure all services are up in both Consul datacenters (DC1 and DC2).
To access the Consul UI use the following ACL bootstrap token: 20d16fb2-9bd6-d238-bfdc-1fab80177667
Warning
Do not make your Consul bootstrap token publicly available in a production environment.
Lastly, verify the application is up and running by going to http://localhost:80.
Consul and application resiliency
The frontend service queries the public api service. This API request occurs while the frontend is loading in your web browser. To verify there is no automatic failover, take down the public api service.
This will cause an error when you refresh the HashiCups landing page. Refresh the web browser page to verify that an error occurs.
You can also see in in the application logs of the frontend service that the upstream service public api is unavailable.
Bring the public api service back up:
If you refresh the HashiCups landing page you will now see a selection of HashiCorp flavored coffees.
Unfortunately, the error scenario you created is a common occurrence. You could conduct a chaos experiment to safely confirm the following hypothesis: the application is unable to handle a container failure.
There are ways to mitigate this error through load balancers and by creating DNS rules and health checks. However, this introduces additional complexity to the application architecture. You could immediately address this problem by adding more service replication instances in the local Consul datacenter. The service mesh would simply route the requests to the local healthy instances.
Adding more service instances helps, but what happens if the entire geographical region becomes unavailable? The ideal solution should be simple to implement, contain minimal application changes, and be flexible enough to adapt to changing environments. This is where Consul can help you create the ideal solution.
A key service mesh benefit is the ability to dynamically change the application traffic, which enables you to route the traffic to healthy instances or backup services. As long as the service is registered and participating in the service mesh, it becomes eligible to receive application traffic.
Service Resolvers
In this tutorial, replicas of all services in DC1 are deployed in another Consul datacenter, DC2. As such, you can route application traffic to a healthy instance of the public api service in DC2 using Consul service resolvers.
Service resolvers allow you to define and control how upstream services are discovered and made available to the application. Service resolvers can also be used to define failover behaviors. Defining failover behavior is a powerful tool in improving application resiliency. You can use a service resolver configuration to enable dynamic failover to another Consul datacenter or even another geographical region. Review the following service resolver configuration.
The service resolver configuration above is for the public api service. The Failover
entry is where you define the failover behavior.
In the preceding configurations, dc2
is configured as the primary failover datacenter. The "*"
represents all instances of the service in the specified datacenter.
The ConnectTimeout
is set to 0
to enable rapid feedback, but you could increase this if you have latency concerns or other application overhead that warrants increased timeout values.
Note
Visit the service resolver documentation to learn more about the configuration details for service resolvers.
Use the command below to apply the service resolver configuration to the public api service.
Verify that the service resolver is applied by opening the Consul UI in a browser and viewing the service overview page for the public api. Once you are in the overview page for the public api, click on the Routing tab. The routing page shows that the resolver configuration applied, along with other configurations that shape network behavior. If you change the datacenter view to DC2, you will also see the configuration applies to the public api service in DC2.
Take down the public api service in DC1 to thoroughly verify the configuration.
Next, refresh the HashiCups landing page. You should receive a list of coffees available for order. Restore the public api service when you are ready to move on.
We have added and applied a service resolver configuration to the other services for your convenience. Feel free to explore the service resolver configurations of the different services.
At this point you are ready to start conducting chaos experiments.
Start the chaos experiment
As a reminder, you will attempt to disprove the hypothesis that the application cannot handle a container failure at the backend. To test this hypothesis, you will use the open-source tool Pumba. Pumba is a chaos testing tool for containers that is capable of emulating network failures and stress-testing container resources.
To help you see the failover, we have crated a script that repeatly purchases a cup of coffee. You may explore the HashiCups UI throughout the experiment but due to browser caching behavior it's not a reliable measurement tool. The script provides a better visual representation because each API request is printed to the console. Remember, the public api service makes API calls to the other downstream services, therefore if any of the upstream requests were to fail due to the service being down, the overall request would fail. This allows us to verify the application is still available and responding.
We recommend opening a split screen view of the Consul UI and the Terminal window while conducting the test so that you can see how Consul detects service disruptions.
Start the coffee purchase script from inside the frontend container in DC1.
Open another terminal tab and issue the following command to start the chaos experiment using Pumba. This experiment will only target the primary region (DC1) . We want to verify the failover to DC2 is working.
Note
To stop the experiment type Ctrl
+ C
or Command
+ C
As the test runs, the Consul UI shows service disruptions while Pumba takes down respective service containers. When Consul detects that the service is down, Consul uses the information provided in the service resolver configuration and informs the Envoy proxy to direct network traffic to the healthy service in DC2. The diagram below depicts the failover behavior with service A attempting to reach service B .
The healthy service accepts the requests and responds. The healthy response can be seen in the output of the coffee order script.
The chaos experiment you triggered only targeted the primary Consul datacenter. If you want to, you may trigger another experiment by targeting both Consul datacenters. To start another experiment, first make sure to stop the current running experiment. Next, issue the command below.
To use the coffee purchase order script in DC2, open another terminal tab and issue the command below.
The second experiment can be used to test an active-active application architecture. You can visit the HashiCups UI for DC2 by going to https://localhost:3030.
Note
Active-active refers to an application architecture that support multiple instances of the application running simultaneously. Active-active architecures are used to improve resiliency, availability, and for loadbalancing traffic.
You may stop the experiment and start your own set of experiments by changing or adding/removing the --label
CLI flag in Pumba. The label flag points to the labels applied to each container in the docker-compose.yml
file.
Feel free to make changes as needed.
1 2 3 4 5 6 7 8 9 10111213141516171819
Use Ctrl
+ C
or Command
+ C
to stop the experiment scripts when you feel you are done with the chaos experiments.
Reviewing the results
In this experiment you can see that the application is able to process coffee orders with minimal disruptions. You might have noticed that occasionally a single request fails when a failover occurs for the public-api service.
This is an important data point and something that should be addressed. Due to the nature of the application and its payment processing capability, ideally all requests should be handled correctly. There are various ways to address this problem, but the critical takeaway is that you now have an important finding to review and discuss with your team regarding next steps.
The hypothesis that you tried to disprove "the application cannot handle a container failure at the backend" is technically not disproven. Implementing a service resolver through Consul helped tremendously improve the overall resiliency of the application. But there is room for improvement, which is the most important lesson to learn from Chaos Engineering.
The HashiCups application does not have any retry logic. Implementing retry logic though Consul and deploying more than a single instance in each Consul datacenter would help address the gaps that the chaos experiment revealed. Finding opportunities for improvement is better than passing a series of test cases that test the application in an isolated environment.
Clean-up
Stop all experiments by using Ctrl
+ C
or Command
+ C
. Next, stop the Docker environment.
Next steps
This tutorial introduced chaos engineering and demonstrated how Consul can improve application resiliency through the service mesh and by applying service resolver configurations. You conducted chaos experiments that tested a hypothesis that was not conclusively disproven. In a very brief lesson, you learned that some steps can be taken to improve the application and its resilience. Chaos Engineering is all about learning and identifying areas of opportunity for improvement. By implementing Consul's service mesh and taking advantage of the dynamic network functionality Consul offers, you can significantly improve your application's availability and resiliency without increasing the complexity of the application and architecture.