Troubleshoot Vault
Troubleshooting Vault involves learning from available sources of observability and monitoring, like server or client error messages, audit devices, and telemetry metrics to isolate the root cause of an issue.
Vault provides operators with a rich collection of data, which can help the HashiCups team to learn about the root cause of an issue when troubleshooting issues.
Server and client output: Oliver and Steve can use the Vault server's operational output or messages from the CLI or API to help troubleshoot use case and server issues.
Audit device output: The team can also enable audit devices to record the details of every request and response between applications and Vault. This data is also handy for troubleshooting issues with specific use cases and client applications.
Telemetry metrics: Since Steve's job is to ensure that Vault performs its best, they can enable and use telemetry metrics which measure Vault server performance in one of several popular formats. They can then use the team's aggregation solution to roll up Vault metrics and share them through dashboards and enable alerting on key metrics.
Scenario
Oliver, Steve, and Danielle are all involved in troubleshooting Vault in some form or another. Danielle uses Vault API warnings and errors to understand client issues when building applications or plugins, and sometimes needs to use audit device entries to troubleshoot tricky policy problems.
Oliver and Steve rely on Vault's server output, telemetry metrics and audit device entries to troubleshoot a range of issues with Vault use cases and issues with the servers.
Launch Terminal
This tutorial includes a free interactive command-line lab that lets you follow along on actual cloud infrastructure.
Server output
Vault outputs server operational data to the operating system standard output and standard error devices, and the Linux systemd journal automatically gathers this output by default. This means the team can consume operational output from Vault in the same way they do with other systemd services.
Steve can troubleshoot Vault issues with the server output because it provides issue context, including timestamps and warning or error messages from the server. This information gives Steve more insight into the server's state during the time-frame of the incident under investigation.
Server output consists of single lines, which follow a consistent format shown and described in the following example.
Each log line has the format of: timestamp
[log level]
subsystem:
message
In the example line, the events subsystem logs message at the INFO log level, and the message text of the log line is "Starting event system".
Tip
Oliver can configure a production server to emit logs in different levels of detail from lowest to highest by specifying a log level. The available log levels are error
, warn
, info
(default), debug
, and trace
, with the highest detail levels being most useful for troubleshooting. Similarly, Danielle can pass a -log-level
flag to a development mode server while testing something they are building.
CLI and API output
Vault's CLI outputs warnings and errors to the system standard error. These messages begin with 'Warning' or 'Error', and if the terminal supports color output, warning messages appear in yellow and error messages appear in red.
Here is an error message example from the Vault CLI:
This message indicates that the CLI attempted a connection to the Vault server at https://127.0.0.1:8200
, but the host refused the connection. Such an error is typical when the Vault service is not listening on the address and port of the request.
Vault's HTTP API returns JSON data containing error objects when the server meets with a problem handling the request.
Here is an example error object as part of the server's response.
This error is the result of the client attempting access on a path for which no handler is available.
The root cause might be a typo in the request path on the client side. In this example, the correct K/V secrets engine path begins with operations-secrets
. The request is missing the pluralizing s
in the path, so it fails with this error.
Better together
Oliver can troubleshoot this error with Danielle, and cross-reference entries from an audit device to find Danielle's request and the corresponding Vault response for more troubleshooting context. Synthesizing data from many sources in this way often helps with issue root cause isolation and resolution.
UI messages
Vault's web UI emits warnings and error messages, which are often useful to include when reporting issues for troubleshooting purposes.
Here are some examples of warnings and errors in the Vault UI.
Audit device output
Vault features audit devices, which record all client requests and server responses in a detailed way, and write the data to a configurable destination. Oliver can enable more than one audit device type, and configure one for writing to a file, and one to write to a network socket. Oliver can also configure a syslog audit device for writing to the syslog local agent on Unix systems.
Vault formats audit device entries as JSON objects representing request and response pairs. Each object holds non-sensitive and sensitive values. Vault hashes sensitive values with a salt and the HMAC-SHA256 algorithm so that sensitive content is not present as plaintext in the audit device output.
The audit device type determines where you can find the output. If it is a file audit device, you can typically find the output in a file on a local filesystem. Socket and syslog based audit devices typically forward output to a remote host for ingestion and processing. In such cases, you can find the audit device output in the tool that processes the entries and makes them available to filter, search, and display in dashboards.
The following example builds from the HTTP API error example in the earlier section, where a client requested a path containing a typo. Click each tab to view the request and matching response to learn more about the structure and metadata contents.
The object type is request
, and includes the timestamp of the request along with several related fields about the host, path, and token.
Telemetry metrics
Steve from the SRE team and Oliver in Operations sometimes work together on troubleshooting Vault performance issues. Vault telemetry metrics offer them key insights into cluster or server performance.
Vault emits telemetry metrics configurable for push or pull based consumption. HashiCups uses both solution types for ingesting Vault telemetry metrics depending on the project or use case.
The configuration and format of the metrics depends on the consumer, and output can take the form of tabular or JSON based data. Oliver enabled telemetry for pull metrics from Prometheus on the team's internal testing cluster. This means the team members with correct capabilities can query the /sys/metrics
API endpoint for telemetry data.
A visual dashboard system like Grafana is the typical place for Oliver and Steve to interact with Vault telemetry metrics. The team can also manually query metrics when the situation requires immediate access outside the context of a dashboard.
Here are 2 raw telemetry data examples taken directly from the Vault CLI and HTTP API.
(Persona: Operations)
Oliver has a token with the capability to read from the /sys/metrics
endpoint, so they use the CLI to read the endpoint:
The output is in tabular format containing native Go map structures which are not easy to read. You can search the output for certain metric names to zero in on their values as part of troubleshooting, or use another output format.
The following is an abbreviated output example:
The metrics named vault.audit.log_request_failure
, vault.autopilot.failure_tolerance
, and vault.audit.file/.log_request
are present along with their values in this example.
Troubleshoot a server issue
(Persona: Operations)
Oliver is starting a self-managed Vault server for a user acceptance testing cluster that they are preparing. They installed Vault on Linux with the official community edition package.
The configuration file Oliver created for the cluster servers looks like this example:
When Oliver tries to start the 'vault' service, it fails to start and systemctl
returns the following error message:
This error message includes some actions Oliver can take to dive deeper into the error condition and find its cause.
For example, if Oliver uses checks the systemd journal, they'll learn more about the cause of this error.
Example abbreviated output:
Notice the helpful error explanation text, "Cluster address must be set when using raft storage". This indicates that there is a missing server configuration option, and Oliver must add the option to the existing configuration to resolve the issue.
Oliver can research the Vault server configuration file documentation for the Integrated Storage (Raft) storage backend to learn more about the cluster address requirement. They can then update the configuration with the necessary option.
Here is an example of the working configuration file after Oliver's update:
1 2 3 4 5 6 7 8 9 101112131415
Troubleshoot a client issue
The Vault client CLI emits helpful warnings and errors when issues arise. Vault users can find the issue root cause and fix the problem with these messages. The following are some examples of CLI errors with causes and resolutions.
Client and server protocol mismatch
(Persona: Operations)
A commonly encountered issue where the client emits a message that is useful for troubleshooting involves a basic mismatch in protocol usage between the client and the server.
Users like Oliver, who make regular use of the CLI can encounter this type of issue while operating a development mode server, for example.
If Oliver starts Vault in development mode without specifying any flags, like this:
The development mode server starts with operational logging emitted to the standard output; an abbreviated example of that output follows.
The Vault server is ready. If Oliver attempts to access the server in another terminal session without exporting the proper VAULT_ADDR
environment variable or passing the -address
flag, commands can fail.
For example, Oliver checks the server status:
The command fails with the following useful error message:
The client warns that both VAULT_ADDR
and the flag -address
have unset values, and mentions the value it will use instead, "https://127.0.0.1:8200"
. The client also returns an error message that includes "http: server gave HTTP response to HTTPS client".
The Vault CLI expects to use an HTTPS connection to the server by default.
Since Oliver started the development mode server without using the flag to enable built-in TLS, the server started with an insecure HTTP listener. The CLI needs HTTPS, but the server does not have a TLS enabled listener in this case, and the CLI exits with the error.
The solution to this issue is actually contained as a tip in the server output:
Oliver can export the VAULT_ADDR
environment variable, and specify that the CLI use an HTTP URL for the server address.
Now the CLI commands work as expected.
Example output:
When running Vault in dev mode, how do you know what address to set for the
VAULT_ADDR
environment variable?
When the Vault startup process completes, it will print the VAULT_ADDR
address
to use.
Troubleshoot a use case issue
(Persona: Developer)
Danielle is leading the development team on NewCup, a brand new application prototype that enables customers to design their own cups using powerful AI models.
They should have access to manage a new collection of related secrets in the developer Vault cluster through the K/V secrets engine enabled at the path project-newcup-secrets
. Their access should include the ability to list, create, read, and update all secrets at this path.
Danielle knows that one of the secrets they can manage is the newcup-aggregator
aggregator API key, which another member of the developer team has written to the secrets engine.
They use the Vault API to read this secret:
Vault responds with the secret contents, including the aggregator API key:
1 2 3 4 5 6 7 8 9 1011121314151617181920212223
Danielle checks to learn if the other project secrets are available by listing the contents of /project-newcup-secrets/metadata/
:
Example output:
Danielle did not expect a permission denied error, and they should have the ability to list the secrets, so they start troubleshooting the issue.
Given that this issue involves permission to the secret, Danielle can begin troubleshooting by examining the ACL policies associated with their token through a token self-lookup.
Example output:
1 2 3 4 5 6 7 8 9 101112131415161718192021222324252627282930313233
The results of Danielle's token lookup hold a list of policies associated with their token, and include the project-newcup-developers
policies in that list.
Danielle does not have the capability to read the policy to know if it is correct.
Example output:
Danielle reaches out to the operations team through a support request to confirm that the project-newcup-developers
policy includes the capability to list secrets.
(Persona: Operations)
Oliver from operations can access the Vault audit device log content from the SIEM solution and find the corresponding request that Danielle made. Here is the raw JSON from that request as it appears in the audit device log.
1 2 3 4 5 6 7 8 9 10111213141516171819202122232425262728293031323334353637383940414243
In addition to several handy data points, the request information also holds the permission denied error, along with the path of the request operation and the type of operation requested.
Oliver agrees that this is possibly due to a policy issue, and reads the project-newcup-developers
policy to confirm:
Example output:
The capabilities do not include "list". This is the root cause of the permission denied error that has Danielle and the NewCup project team blocked.
Oliver updates the policy to add the list capability. Since policy updates are not retroactive, they ask Danielle to authenticate to Vault for a new token with the updated policy attached.
(Persona: Developer)
Danielle authenticates to Vault, and gets a new token with the updated project-newcup-developers policy attached. They are able to list all secrets in the project-newcup-secrets secrets engine now:
Example output:
Summary
Vault provides logs and metrics to help you identify and support your Vault deployment. You can ship both logs and metrics to observability and SIEM tools to consume this critical information.