Monitor Consul server health and performance with metrics and logs

10min
|
Consul

Consul server metrics and logs give you detailed statistical and performance information about your Consul cluster. Metrics provide a general overview of system health and performance, while logs provide context and details used to diagnose issues and identify the root cause of problems. Once you enable these Consul observability features, Consul emits runtime metrics and operational logs of its subsystems.

In this tutorial, you will enable Consul server metrics and server logging for your Consul cluster. You will use Grafana to explore dashboards that provide information regarding health, performance, and operations for your Consul cluster. In the process, you will learn how using these features can provide you with deep insights into the operational health and performance of your Consul cluster.

Scenario overview

To begin this tutorial, you will use Terraform to deploy a self-managed Consul cluster and an observability suite on Elastic Kubernetes Service (EKS).

The architecture diagram of the scenario. This shows the Kubernetes environment and self-managed Consul cluster.

Each Consul server can emit server metrics and server logs that contain timings, protocols, and additional information for analyzing the health and performance of your Consul cluster. By configuring the Consul Helm chart, you can configure your Consul servers to emit this observability information so Prometheus and Promtail can scrape and store the data. You can then visualize the metrics and logs with Grafana.

The observability traffic flow diagram of the scenario.

In this tutorial, you will:

Deploy the following resources with Terraform:
- Elastic Kubernetes Service (EKS) cluster
- A self-managed Consul datacenter on EKS
- Grafana, Prometheus, and Loki on EKS
Perform the following Consul control plane procedures:
- Review and enable servers metrics and server logging features
- Explore dashboards with Grafana

Prerequisites

The tutorial assumes that you are familiar with Consul and its core functionality. If you are new to Consul, refer to the Consul Getting Started tutorials collection.

For this tutorial, you will need:

An AWS account configured for use with Terraform
(Optional) An HCP account
aws-cli >= 2.0
terraform >= 1.0
consul >= 1.17.0
consul-k8s >= 1.2.0
helm >= 3.0
git >= 2.0
kubectl > 1.24

Clone GitHub repository

Clone the GitHub repository containing the configuration files and resources.

$ git clone https://github.com/hashicorp-education/learn-consul-cluster-telemetry

Change into the directory that contains the complete configuration files for this tutorial.

$ cd learn-consul-cluster-telemetry/self-managed/eks

Review repository contents

This repository contains Terraform configuration to spin up the initial infrastructure and all files to deploy Consul, the demo application, and the observability suite resources.

The eks directory contains the following Terraform configuration files:

aws-vpc.tf defines the AWS VPC resources
eks-cluster.tf defines Amazon EKS cluster deployment resources
eks-consul.tf defines the self-managed Consul deployment
eks-observability.tf defines the Prometheus, Promtail, Loki, and Grafana resources
outputs.tf defines outputs you will use to authenticate and connect to your Kubernetes cluster
providers.tf defines AWS and Kubernetes provider definitions for Terraform
variables.tf defines variables you can use to customize the tutorial

The directory also contains the following subdirectories:

../../dashboards contains the JSON configuration files for the example Grafana dashboards
config contains custom Consul ACL configuration file and the Consul synthetic load generator configuration file
helm contains the Helm charts for Consul, Prometheus, Promtail, Loki, and Grafana

Deploy infrastructure and demo application

With these Terraform configuration files, you are ready to deploy your infrastructure. Initialize your Terraform configuration to download the necessary providers and modules.

$ terraform init
Initializing the backend...
Initializing provider plugins...
## ...
Terraform has been successfully initialized!
## …

Then, deploy the resources. Confirm the run by entering yes.

$ terraform apply
## ...
Do you want to perform these actions?
Terraform will perform the actions described above.
Only 'yes' will be accepted to approve.
Enter a value: yes
## ...
Apply complete! Resources: 61 added, 0 changed, 0 destroyed.

The Terraform deployment could take up to 15 minutes to complete.

Connect to your infrastructure

Now that you have deployed the Kubernetes cluster, configure kubectl to interact with it.

$ aws eks --region $(terraform output -raw region) update-kubeconfig --name $(terraform output -raw kubernetes_cluster_id)

Ensure all services are up and running successfully

Check the pods across all namespaces to confirm they are running successfully.

$ kubectl get pods --all-namespaces --field-selector metadata.namespace!=kube-system
NAMESPACE       NAME                                                 READY   STATUS    RESTARTS   AGE
consul          consul-connect-injector-9b944b6c4-hq99p              1/1     Running   0          7m17s
consul          consul-server-0                                      1/1     Running   0          7m17s
consul          consul-server-1                                      1/1     Running   0          7m17s
consul          consul-server-2                                      1/1     Running   0          7m17s
consul          consul-webhook-cert-manager-9d7cc8cc5-wpx76          1/1     Running   0          7m17s
observability   grafana-5dccdcd7c8-qhhbr                             1/1     Running   0          6m3s
observability   loki-0                                               1/1     Running   0          6m27s
observability   loki-canary-dqpdt                                    1/1     Running   0          6m27s
observability   loki-canary-fgndz                                    1/1     Running   0          6m27s
observability   loki-canary-j7k7q                                    1/1     Running   0          6m27s
observability   loki-gateway-5c59784b98-k4wgk                        1/1     Running   0          6m27s
observability   loki-grafana-agent-operator-d7c684bf9-jkgkb          1/1     Running   0          6m27s
observability   loki-logs-4lfxm                                      2/2     Running   0          6m22s
observability   loki-logs-96wcb                                      2/2     Running   0          6m22s
observability   loki-logs-zvspl                                      2/2     Running   0          6m22s
observability   prometheus-kube-state-metrics-8646c88b45-q5rbz       1/1     Running   0          6m34s
observability   prometheus-prometheus-node-exporter-57rqj            1/1     Running   0          6m34s
observability   prometheus-prometheus-node-exporter-d6c6f            1/1     Running   0          6m34s
observability   prometheus-prometheus-node-exporter-tjlfs            1/1     Running   0          6m34s
observability   prometheus-prometheus-pushgateway-79ff799669-4gm44   1/1     Running   0          6m34s
observability   prometheus-server-6c87bf4dd9-s7m7x                   2/2     Running   0          6m34s
observability   promtail-ccm5q                                       1/1     Running   0          5m8s
observability   promtail-djcpn                                       1/1     Running   0          5m8s
observability   promtail-tzp6g                                       1/1     Running   0          5m8s

Configure your CLI to interact with Consul datacenter

In this section, you will set environment variables in your terminal so your Consul CLI can interact with your Consul datacenter. The Consul CLI reads these environment variables for behavior defaults and will reference these values when you run consul commands.

Set the Consul server destination address.

$ export CONSUL_HTTP_ADDR=https://$(kubectl get services/consul-ui --namespace consul -o jsonpath='{.status.loadBalancer.ingress[0].hostname}')

Retrieve the ACL bootstrap token from the respective Kubernetes secret and set it as an environment variable.

$ export CONSUL_HTTP_TOKEN=$(kubectl get --namespace consul secrets/consul-bootstrap-acl-token --template={{.data.token}} | base64 -d)

Remove SSL verification checks to simplify communication to your Consul datacenter.

$ export CONSUL_HTTP_SSL_VERIFY=false

In a production environment, we recommend keeping this SSL verification set to true. Only remove this verification if you have a Consul datacenter without TLS configured in development environment and demonstration purposes.

Verify that you can communicate with your Consul cluster by printing all known nodes and the metadata about them.

$ consul catalog nodes
Node             ID        Address     DC
consul-server-0  6965c864  10.0.6.174  dc1
consul-server-1  d461434e  10.0.4.52   dc1
consul-server-2  b73bfdf9  10.0.5.101  dc1

Enable Consul server metrics and logging

Consul server metrics and logs provide you with detailed health and performance information for your Consul clusters. In this section, you will review the parameters that enable these features and update your Consul installation to apply the new configuration.

Review the Consul values file

Consul lets you expose metrics and logs for your server pods so they may be scraped by a Prometheus service that is outside of your service mesh. Review these snippets from the helm/consul-v2-telemetry.yaml configuration file to see the parameters that enable these features.

Consul metrics are only exposed on port 8500. Setting httpOnly: false in the TLS block allows Prometheus to scrape this port for metrics.

global:
## ...
 tls:
   httpsOnly: false
## ...

The following block enables metrics for all agents in your Consul datacenter.

global:
## ...
 metrics:
   enabled: true
   enableAgentMetrics: true
## ...

This block configures your Consul servers to emit server logs.

## ...
server:
 ## …
 extraConfig: |
   {
     "log_level": "TRACE"
   }
## …

Refer to the Consul metrics for Kubernetes documentation and official Helm chart values to learn more about metrics configuration options and details.

Update the Consul deployment

Update Consul in your Kubernetes cluster with Consul K8S CLI to let Prometheus collect metrics from your Consul servers. Confirm the run by entering y.

$ consul-k8s upgrade -config-file=helm/consul-v2-telemetry.yaml

Refer to the Consul K8S CLI documentation to learn more about additional settings.

Update Consul in your Kubernetes cluster with Helm to let Prometheus collect metrics from your Consul servers.

$ helm upgrade --values helm/consul-v2-telemetry.yaml consul hashicorp/consul --namespace consul --version "1.2.1"

The Consul update could take up to 5 minutes to complete.

Review the official Helm chart values to learn more about these settings.

Configure the anonymous ACL policy

In addition to configuring Consul, you need to modify the anonymous ACL policy to allow agent:read permissions so Prometheus can scrape metrics from the secure Consul servers. Other permissions in the included file will allow the Consul load generator service to communicate with the respective Consul features.

$ consul acl policy update -name "anonymous-token-policy" \
                       -datacenter "dc1" \
                       -rules @config/acl-policy.hcl

Review the Consul ACL Policies documentation to learn more.

Note

In a production environment, we recommend using the Prometheus Consul Exporter for the most secure, restrictive access to Consul metrics on port 8501.

Deploy the Consul load generator

Deploy the Consul load generator to create synthetic loads for KV, service registration, and the ACL engine. This will create more realistic visualizations in your Grafana dashboards.

$ kubectl apply -f config/consul-load-generator.yaml

Explore Consul health and performance dashboards

Consul control plane metrics and logs provide you with detailed health and performance information for your Consul servers. In this section, you will use Grafana to examine how this information provides insights into your Consul control plane.

Explore Consul telemetry dashboard

Navigate to the control plane monitoring dashboard.

$ export GRAFANA_CP_DASHBOARD=http://$(kubectl get svc/grafana --namespace observability -o json | jq -r '.status.loadBalancer.ingress[0].hostname')/d/control-plane-performance-monitoring && echo $GRAFANA_CP_DASHBOARD
http://a20fb6f2d1d3e4be296d05452a378ad2-428040929.us-west-2.elb.amazonaws.com/d/control-plane-performance-monitoring

The example dashboards take a few minutes to populate with data after the telemetry metrics feature is enabled.

This dashboard provides several sections that give you a variety of information for your Consul control plane. These graphs can be useful to analyze the health of your Consul server pods to identify any anomalies in behavior.

The system stats tab.

Notice that the System Stats tab includes CPU usage and memory usage metrics. High metrics in these areas can cause long loading times, slow performance, and unexpected crashes.

Now, click on the Consul Server Behavior tab. This tab gives insight into the health of Consul's raft protocol, with higher than average numbers indicating slowdowns in reaching a state of concensus between Consul servers.

The Consul Server Behavior tab.

Click on the Feature: Catalog tab.

The Catalog Feature tab.

This tab provides health information about the registration/deregistration of nodes, services, and checks in Consul. This can provide useful insight into the load pressure on each of your Consul servers.

Tip

Consul telemetry metrics contain a large set of statistics that you can use to create custom dashboards for monitoring your Consul clusters according to your production environment's unique requirements. Refer to the Consul telemetry overview for a complete list and description of available metrics.

Explore Consul server logs dashboard

Navigate to the control plane logs dashboard.

$ export GRAFANA_CP_LOGS_DASHBOARD=http://$(kubectl get svc/grafana --namespace observability -o json | jq -r '.status.loadBalancer.ingress[0].hostname')/d/control-plane-logs/ && echo $GRAFANA_CP_LOGS_DASHBOARD
http://a20fb6f2d1d3e4be296d05452a378ad2-428040929.us-west-2.elb.amazonaws.com/d/control-plane-logs/

The Grafana dashboard may take a few moments to fully load in your browser.

Notice that the example dashboard panes provide detailed event and error insights for your Consul control plane.

The initial state of the dashboard.

For example, the RPC Server Call Request Type Distribution pie chart gives you the read/write ratio of RPC server calls in your Consul cluster during a specific time window.

Type request_type=write in the search field to look deeper into the server logs.

State of the dashboard after something is typed in the search filter.

Notice how this action applies a filter to the respective visualizations and raw logs containing that value so you can zoom into error logs for further analysis and troubleshooting. Click on one of the raw logs to view the entire access log contents.

Zoomed in on raw logs.

Notice that you can explore the other fields associated with your search terms to learn more information about a particular error or event.

Clean up resources

Destroy the Terraform resources to clean up your environment. Confirm the destroy operation by inputting yes.

$ terraform destroy

## ...
Do you really want to destroy all resources?
 Terraform will destroy all your managed infrastructure, as shown above.
 There is no undo. Only 'yes' will be accepted to confirm.

Enter a value: yes

## ...

Destroy complete! Resources: 0 added, 0 changed, 61 destroyed.

Note

Due to race conditions with the cloud resources in this tutorial, you may need to run the destroy operation twice to remove all the resources.

Next steps

In this tutorial, you enabled Consul server metrics and logs to enhance the health and performance monitoring of your Consul cluster. This integration offers increased control plane understanding, reduced operational overhead, and faster incident resolution.

For more information about the topics covered in this tutorial, refer to the following resources:

Debug services with proxy access logs

Audit logging