Telemetry
The Vault server process collects various runtime metrics about the performance of different libraries and subsystems. These metrics are aggregated on a ten second interval and are retained for one minute in-memory. In order to monitor Vault and collect durable metrics, Telemetry from Vault must be stored in metrics aggregation software.
To view the raw data, you must send a signal to the Vault process: on Unix-style operating systems, this is USR1
while on Windows it is BREAK
. When the Vault process receives this signal it will dump the current telemetry information to the process's stderr
.
This telemetry information can be used for debugging or otherwise getting a better view of what Vault is doing.
Telemetry information can also be streamed directly from Vault to a range of metrics aggregation solutions as described in the telemetry Stanza documentation.
The following is an example telemetry dump snippet:
You'll note that log entries are prefixed with the metric type as follows:
- [C] is a counter. Counters are cumulative metrics that are incremented when some event occurs, and are reset at the end of reporting intervals. Vault retains counters and other metrics for one minute in-memory, so to see accurate and persistent counters over time an aggregation solution must be configured.
- [G] is a gauge. Gauges provide measurements of current values.
- [S] is a summary. Summaries provide sample observations of values. Vault commonly uses summaries for measuring timing duration of discrete events in the reporting interval.
The following sections describe available Vault metrics. The metrics interval can be assumed to be 10 seconds when manually triggering metrics output using the above described signals. Some high-cardinality gauges, like vault.kv.secret.count
, are emitted every 10 minutes, or at an interval configured in the telemetry
stanza.
Some Vault metrics come with additional labels describing the measurement in more detail, such as the namespace in which an operation takes place, or the auth method used to create a token. In the in-memory telemetry, or other telemetry engines that do not support labels, this additional information is incorporated into the metric name. The metric name in the table below is followed by a list of labels supported, in the order in which they appear if flattened.
Audit Metrics
These metrics relate to auditing.
Metric | Description | Unit | Type |
---|---|---|---|
vault.audit.log_request | Duration of time taken by all audit log requests across all audit log devices | ms | summary |
vault.audit.log_response | Duration of time taken by audit log responses across all audit log devices | ms | summary |
vault.audit.log_request_failure | Number of audit log request failures. NOTE: This is a particularly important metric. Any non-zero value here indicates that there was a failure to make an audit log request to any of the configured audit log devices; when Vault cannot log to any of the configured audit log devices it ceases all user operations, and you should begin troubleshooting the audit log devices immediately if this metric continually increases. | failures | counter |
vault.audit.log_response_failure | Number of audit log response failures. NOTE: This is a particularly important metric. Any non-zero value here indicates that there was a failure to receive a response to a request made to one of the configured audit log devices; when Vault cannot log to any of the configured audit log devices it ceases all user operations, and you should begin troubleshooting the audit log devices immediately if this metric continually increases. | failures | counter |
NOTE: In addition, there are audit metrics for each enabled audit device represented as vault.audit.<type>.log_request
. For example, if a file audit device is enabled, its metrics would be vault.audit.file.log_request
and vault.audit.file.log_response
.
Core Metrics
These metrics represent operational aspects of the running Vault instance.
Metric | Description | Unit | Type |
---|---|---|---|
vault.barrier.delete | Duration of time taken by DELETE operations at the barrier | ms | summary |
vault.barrier.get | Duration of time taken by GET operations at the barrier | ms | summary |
vault.barrier.put | Duration of time taken by PUT operations at the barrier | ms | summary |
vault.barrier.list | Duration of time taken by LIST operations at the barrier | ms | summary |
vault.cache.hit | Number of times a value was retrieved from the LRU cache. | cache hit | counter |
vault.cache.miss | Number of times a value was not in the LRU cache. The results in a read from the configured storage. | cache miss | counter |
vault.cache.write | Number of times a value was written to the LRU cache. | cache write | counter |
vault.cache.delete | Number of times a value was deleted from the LRU cache. This does not count cache expirations. | cache delete | counter |
vault.core.active | Has value 1 when the vault node is active, and 0 when node is in standby. | bool | gauge |
vault.core.activity.fragment_size | Number of entities or tokens (depending on the "type" label) observed by the local node. | tokens | counter |
vault.core.activity.segment_write | Duration of time taken writing activity log segments to storage. | ms | summary |
vault.core.check_token | Duration of time taken by token checks handled by Vault core | ms | summary |
vault.core.fetch_acl_and_token | Duration of time taken by ACL and corresponding token entry fetches handled by Vault core | ms | summary |
vault.core.handle_request | Duration of time taken by non-login requests handled by Vault core | ms | summary |
vault.core.handle_login_request | Duration of time taken by login requests handled by Vault core | ms | summary |
vault.core.in_flight_requests | Number of in-flight requests. | requests | gauge |
vault.core.leadership_setup_failed | Duration of time taken by cluster leadership setup failures which have occurred in a highly available Vault cluster. This should be monitored and alerted on for overall cluster leadership status. | ms | summary |
vault.core.leadership_lost | Duration of time taken by cluster leadership losses which have occurred in a highly available Vault cluster. This should be monitored and alerted on for overall cluster leadership status. | ms | summary |
vault.core.license.expiration_time_epoch | Time as epoch (seconds since Jan 1 1970) at which license will expire. | seconds | gauge |
vault.core.mount_table.num_entries | Number of mounts in a particular mount table. This metric is labeled by table type (auth or logical) and whether or not the table is replicated (local or not) | objects | gauge |
vault.core.mount_table.size | Size of a particular mount table. This metric is labeled by table type (auth or logical) and whether or not the table is replicated (local or not) | bytes | gauge |
vault.core.post_unseal | Duration of time taken by post-unseal operations handled by Vault core | ms | summary |
vault.core.pre_seal | Duration of time taken by pre-seal operations | ms | summary |
vault.core.seal-with-request | Duration of time taken by requested seal operations | ms | summary |
vault.core.seal | Duration of time taken by seal operations | ms | summary |
vault.core.seal-internal | Duration of time taken by internal seal operations | ms | summary |
vault.core.step_down | Duration of time taken by cluster leadership step downs. This should be monitored and alerted on for overall cluster leadership status. | ms | summary |
vault.core.unseal | Duration of time taken by unseal operations | ms | summary |
vault.core.unsealed | Has value 1 when Vault is unsealed, and 0 when Vault is sealed. | bool | gauge |
vault.metrics.collection (cluster,gauge) | Time taken to collect usage gauges, labelled by gauge type. | summary | |
vault.metrics.collection.interval (cluster,gauge) | Current value of usage gauge collection interval. | summary | |
vault.metrics.collection.error (cluster,gauge) | Errors while collection usage gauges, labeled by gauge type. | counter | |
vault.rollback.attempt.<mountpoint> | Time taken to perform a rollback operation on the given mount point. The mount point name has its forward slashes / replaced by - . For example, a rollback operation on the auth/token backend would be reportes as vault.rollback.attempt.auth-token- . | ms | summary |
vault.route.create.<mountpoint> | Time taken to dispatch a create operation to a backend, and for that backend to process it. The mount point name has its forward slashes / replaced by - . For example, a create operation to ns1/secret/ would have corresponding metric vault.route.create.ns1-secret- . The number of samples of this metric, and the corresponding ones for other operations below, indicates how many operations were performed per mount point. | ms | summary |
vault.route.delete.<mountpoint> | Time taken to dispatch a delete operation to a backend, and for that backend to process it. | ms | summary |
vault.route.list.<mountpoint> | Time taken to dispatch a list operation to a backend, and for that backend to process it. | ms | summary |
vault.route.read.<mountpoint> | Time taken to dispatch a read operation to a backend, and for that backend to process it. | ms | summary |
vault.route.rollback.<mountpoint> | Time taken to dispatch a rollback operation to a backend, and for that backend to process it. Rollback operations are automatically scheduled to clean up partial errors. | ms | summary |
Runtime Metrics
These metrics collect information from Vault's Go runtime, such as memory usage information.
Metric | Description | Unit | Type |
---|---|---|---|
vault.runtime.alloc_bytes | Number of bytes allocated by the Vault process. This could burst from time to time, but should return to a steady state value. | bytes | gauge |
vault.runtime.free_count | Number of freed objects | objects | gauge |
vault.runtime.heap_objects | Number of objects on the heap. This is a good general memory pressure indicator worth establishing a baseline and thresholds for alerting. | objects | gauge |
vault.runtime.malloc_count | Cumulative count of allocated heap objects | objects | gauge |
vault.runtime.num_goroutines | Number of goroutines. This serves as a general system load indicator worth establishing a baseline and thresholds for alerting. | goroutines | gauge |
vault.runtime.sys_bytes | Number of bytes allocated to Vault. This includes what is being used by Vault's heap and what has been reclaimed but not given back to the operating system. | bytes | gauge |
vault.runtime.total_gc_pause_ns | The total garbage collector pause time since Vault was last started | ns | gauge |
vault.runtime.gc_pause_ns | Total duration of the last garbage collection run | ns | summary |
vault.runtime.total_gc_runs | Total number of garbage collection runs since Vault was last started | operations | gauge |
Policy Metrics
These metrics report measurements of the time spent performing policy operations.
Metric | Description | Unit | Type |
---|---|---|---|
vault.policy.get_policy | Time taken to get a policy | ms | summary |
vault.policy.list_policies | Time taken to list policies | ms | summary |
vault.policy.delete_policy | Time taken to delete a policy | ms | summary |
vault.policy.set_policy | Time taken to set a policy | ms | summary |
Token, Identity, and Lease Metrics
These metrics cover measurement of token, identity, and lease operations, and counts of the number of such objects managed by Vault.
Metric | Description | Unit | Type |
---|---|---|---|
vault.expire.fetch-lease-times | Time taken to fetch lease times | ms | summary |
vault.expire.fetch-lease-times-by-token | Time taken to fetch lease times by token | ms | summary |
vault.expire.num_leases | Number of all leases which are eligible for eventual expiry | leases | gauge |
vault.expire.num_irrevocable_leases | Number of leases that cannot be revoked automatically | leases | gauge |
vault.expire.leases.by_expiration (cluster,gauge,expiring,namespace) | Number of leases set to expire, grouped by a time interval. This time interval and total number of time intervals are configurable via lease_metrics_epsilon and num_lease_metrics_buckets in the telemetry stanza of a vault server configuration. The default values for these are 1hr and 168 respectively, so the metric will report the number of leases that will expire each hour from the current time to a week from the current time. One can additionally group lease expiration by namespace by setting add_lease_metrics_namespace_labels to true in the config file (default is false ). | leases | gauge |
vault.expire.job_manager.total_jobs | Total pending revocation jobs | leases | summary |
vault.expire.job_manager.queue_length | Total pending revocation jobs by auth method | leases | summary |
vault.expire.lease_expiration | Count of lease expirations | leases | counter |
vault.expire.lease_expiration.time_in_queue | Time taken for lease to get to the front of the revoke queue | ms | summary |
vault.expire.lease_expiration.error | Count of lease expiration errors | errors | counter |
vault.expire.revoke | Time taken to revoke a token | ms | summary |
vault.expire.revoke-force | Time taken to forcibly revoke a token | ms | summary |
vault.expire.revoke-prefix | Time taken to revoke tokens on a prefix | ms | summary |
vault.expire.revoke-by-token | Time taken to revoke all secrets issued with a given token | ms | summary |
vault.expire.renew | Time taken to renew a lease | ms | summary |
vault.expire.renew-token | Time taken to renew a token which does not need to invoke a logical backend | ms | summary |
vault.expire.register | Time taken for register operations | ms | summary |
vault.expire.register-auth | Time taken for register authentication operations which create lease entries without lease ID | ms | summary |
vault.identity.num_entities | Number of identity entities stored in Vault | entities | gauge |
vault.identity.entity.active.monthly (cluster, namespace) | Number of distinct entities that created a token during the past month, per namespace. Only available if client count is enabled. Reported at the start of each month. | entities | gauge |
vault.identity.entity.active.partial_month (cluster) | Total number of distinct entities that created a token during the current month. Only available if client count is enabled. Reported periodically within each month. | entities | gauge |
vault.identity.entity.active.reporting_period (cluster, namespace) | Number of distinct entities that created a token in the past N months, as defined by the client count default reporting period. Only available if client count is enabled. Reported at the start of each month. | entities | gauge |
vault.identity.entity.alias.count (cluster, namespace, auth_method, mount_point) | Number of identity entities aliases stored in Vault, grouped by the auth mount that created them. This gauge is computed every 10 minutes. | aliases | gauge |
vault.identity.entity.count (cluster, namespace) | Number of identity entities stored in Vault, grouped by namespace. | entities | gauge |
vault.identity.entity.creation (cluster, namespace, auth_method, mount_point) | Number of identity entities created, grouped by the auth mount that created them. | entities | counter |
vault.identity.upsert_entity_txn | Time taken to insert a new or modified entity into the in-memory database, and persist it to storage. | ms | summary |
vault.identity.upsert_group_txn | Time taken to insert a new or modified group into the in-memory database, and persist it to storage. This operation is performed on group membership changes. | ms | summary |
vault.token.count (cluster, namespace) | Number of service tokens available for use; counts all un-expired and un-revoked tokens in Vault's token store. This measurement is performed every 10 minutes. | token | gauge |
vault.token.count.by_auth (cluster, namespace, auth_method) | Number of service tokens that were created by a particular auth method. | tokens | gauge |
vault.token.count.by_policy (cluster, namespace, policy) | Number of service tokens that have a particular policy attached. If a token has more than one policy, it is counted in each policy gauge. | tokens | gauge |
vault.token.count.by_ttl (cluster, namespace, creation_ttl) | Number of service tokens, grouped by the TTL range they were assigned at creation. | tokens | gauge |
vault.token.create | The time taken to create a token | ms | summary |
vault.token.create_root | Number of created root tokens. Does not decrease on revocation. | tokens | counter |
vault.token.createAccessor | The time taken to create a token accessor | ms | summary |
vault.token.creation (cluster, namespace, auth_method, mount_point, creation_ttl, token_type) | Number of service or batch tokens created. | tokens | counter |
vault.token.lookup | The time taken to look up a token | ms | summary |
vault.token.revoke | Time taken to revoke a token | ms | summary |
vault.token.revoke-tree | Time taken to revoke a token tree | ms | summary |
vault.token.store | Time taken to store an updated token entry without writing to the secondary index | ms | summary |
Resource Quota Metrics
These metrics relate to rate limit and lease count quotas. Each metric comes with a label "name" identifying the specific quota.
Metric | Description | Unit | Type |
---|---|---|---|
vault.quota.rate_limit.violation | Total number of rate limit quota violations | quota | counter |
vault.quota.lease_count.violation | Total number of lease count quota violations | quota | counter |
vault.quota.lease_count.max | Total maximum amount of leases allowed by the lease count quota | lease | gauge |
vault.quota.lease_count.counter | Total current amount of leases generated by the lease count quota | lease | gauge |
Merkle Tree and Write Ahead Log Metrics
These metrics relate to internal operations on Merkle Trees and Write Ahead Logs (WAL)
Metric | Description | Unit | Type |
---|---|---|---|
vault.merkle.flushDirty | Time taken to flush any dirty pages to cold storage | ms | summary |
vault.merkle.flushDirty.num_pages | Number of pages flushed | pages | gauge |
vault.merkle.saveCheckpoint | Time taken to save the checkpoint | ms | summary |
vault.merkle.saveCheckpoint.num_dirty | Number of dirty pages at checkpoint | pages | gauge |
vault.wal.deleteWALs | Time taken to delete a Write Ahead Log (WAL) | ms | summary |
vault.wal.gc.deleted | Number of Write Ahead Logs (WAL) deleted during each garbage collection run | WAL | gauge |
vault.wal.gc.total | Total Number of Write Ahead Logs (WAL) on disk | WAL | gauge |
vault.wal.loadWAL | Time taken to load a Write Ahead Log (WAL) | ms | summary |
vault.wal.persistWALs | Time taken to persist a Write Ahead Log (WAL) | ms | summary |
vault.wal.flushReady | Time taken to flush a ready Write Ahead Log (WAL) to storage | ms | summary |
vault.wal.flushReady.queue_len | Size of the write queue in the WAL system | WAL | summary |
HA Metrics
These metrics are emitted on standbys when talking to the active node, and in some cases by performance standbys as well.
Metric | Description | Unit | Type |
---|---|---|---|
vault.ha.rpc.client.forward | Time taken to forward a request from a standby to the active node | ms | summary |
vault.ha.rpc.client.forward.errors | Number of standby request forwarding failures | errors | counter |
Replication Metrics
These metrics relate to Vault Enterprise Replication. The following metrics are not available in telemetry unless replication is in an unhealthy state: replication.fetchRemoteKeys
, replication.merkleDiff
, and replication.merkleSync
.
Metric | Description | Unit | Type |
---|---|---|---|
vault.core.replication.performance.primary | Set to 1 if this is a performance primary, 0 if not | boolean | gauge |
vault.core.replication.performance.secondary | Set to 1 if this is a performance secondary, 0 if not | boolean | gauge |
vault.core.replication.dr.primary | Set to 1 if this is a DR primary, 0 if not | boolean | gauge |
vault.core.replication.dr.secondary | Set to 1 if this is a DR secondary, 0 if not | boolean | gauge |
vault.core.performance_standby | Set to 1 if this is a performance standby, 0 if not | boolean | gauge |
vault.logshipper.streamWALs.missing_guard | Number of incidences where the starting Merkle Tree index used to begin streaming WAL entries is not matched/found | missing guards | counter |
vault.logshipper.streamWALs.guard_found | Number of incidences where the starting Merkle Tree index used to begin streaming WAL entries is matched/found | found guards | counter |
vault.logshipper.streamWALs.scanned_entries | Number of entries scanned in the buffer before the right one was found. | scanned entries | summary |
vault.logshipper.buffer.length | Current length of the log shipper buffer | buffer entries | gauge |
vault.logshipper.buffer.size | Current size in bytes of the log shipper buffer | bytes | gauge |
vault.logshipper.buffer.max_length | Maximum length of the log shipper buffer | buffer entries | gauge |
vault.logshipper.buffer.max_size | Maximum size in bytes of the log shipper buffer | bytes | gauge |
vault.replication.fetchRemoteKeys | Time taken to fetch keys from a remote cluster participating in replication prior to Merkle Tree based delta generation | ms | summary |
vault.replication.merkleDiff | Time taken to perform a Merkle Tree based delta generation between the clusters participating in replication | ms | summary |
vault.replication.merkleSync | Time taken to perform a Merkle Tree based synchronization using the last delta generated between the clusters participating in replication | ms | summary |
vault.replication.merkle.commit_index | The last committed index in the Merkle Tree. | sequence number | gauge |
vault.replication.wal.last_wal | The index of the last WAL | sequence number | gauge |
vault.replication.wal.last_dr_wal | The index of the last DR WAL | sequence number | gauge |
vault.replication.wal.last_performance_wal | The index of the last Performance WAL | sequence number | gauge |
vault.replication.fsm.last_remote_wal | The index of the last remote WAL | sequence number | gauge |
vault.replication.wal.gc | Time taken to complete one run of the WAL garbage collection process | ms | summary |
vault.replication.rpc.server.auth_request | Duration of time taken by auth request | ms | summary |
vault.replication.rpc.server.bootstrap_request | Duration of time taken by bootstrap request | ms | summary |
vault.replication.rpc.server.conflicting_pages_request | Duration of time taken by conflicting pages request | ms | summary |
vault.replication.rpc.server.echo | Duration of time taken by echo | ms | summary |
vault.replication.rpc.server.save_mfa_response_auth | Duration of time taken by saving MFA auth response | ms | summary |
vault.replication.rpc.server.forwarding_request | Duration of time taken by forwarding request | ms | summary |
vault.replication.rpc.server.guard_hash_request | Duration of time taken by guard hash request | ms | summary |
vault.replication.rpc.server.persist_alias_request | Duration of time taken by persist alias request | ms | summary |
vault.replication.rpc.server.persist_persona_request | Duration of time taken by persist persona request | ms | summary |
vault.replication.rpc.server.stream_wals_request | Duration of time taken by stream wals request | ms | summary |
vault.replication.rpc.server.sub_page_hashes_request | Duration of time taken by sub page hashes request | ms | summary |
vault.replication.rpc.server.sync_counter_request | Duration of time taken by sync counter request | ms | summary |
vault.replication.rpc.server.upsert_group_request | Duration of time taken by upsert group request | ms | summary |
vault.replication.rpc.client.conflicting_pages | Duration of time taken by client conflicting pages request | ms | summary |
vault.replication.rpc.client.fetch_keys | Duration of time taken by client fetch keys request | ms | summary |
vault.replication.rpc.client.forward | Duration of time taken by client forward request | ms | summary |
vault.replication.rpc.client.guard_hash | Duration of time taken by client guard hash request | ms | summary |
vault.replication.rpc.client.persist_alias | Duration of time taken by | ms | summary |
vault.replication.rpc.client.register_auth | Duration of time taken by client register auth request | ms | summary |
vault.replication.rpc.client.register_lease | Duration of time taken by client register lease request | ms | summary |
vault.replication.rpc.client.stream_wals | Duration of time taken by client s | ms | summary |
vault.replication.rpc.client.sub_page_hashes | Duration of time taken by client sub page hashes request | ms | summary |
vault.replication.rpc.client.sync_counter | Duration of time taken by client sync counter request | ms | summary |
vault.replication.rpc.client.upsert_group | Duration of time taken by client upstert group request | ms | summary |
vault.replication.rpc.client.wrap_in_cubbyhole | Duration of time taken by client wrap in cubbyhole request | ms | summary |
vault.replication.rpc.client.save_mfa_response_auth | Duration of time taken by client saving MFA auth response | ms | summary |
vault.replication.rpc.dr.server.echo | Duration of time taken by DR echo request | ms | summary |
vault.replication.rpc.dr.server.fetch_keys_request | Duration of time taken by DR fetch keys request | ms | summary |
vault.replication.rpc.standby.server.echo | Duration of time taken by standby echo request | ms | summary |
vault.replication.rpc.standby.server.register_auth_request | Duration of time taken by standby register auth request | ms | summary |
vault.replication.rpc.standby.server.register_lease_request | Duration of time taken by standby register lease request | ms | summary |
vault.replication.rpc.standby.server.wrap_token_request | Duration of time taken by standby wrap token request | ms | summary |
Secrets Engines Metrics
These metrics relate to the supported secrets engines.
Metric | Description | Unit | Type |
---|---|---|---|
database.Initialize | Time taken to initialize a database secret engine across all database secrets engines | ms | summary |
database.<name>.Initialize | Time taken to initialize a database secret engine for the named database secrets engine <name> , for example: database.postgresql-prod.Initialize | ms | summary |
database.Initialize.error | Number of database secrets engine initialization operation errors across all database secrets engines | errors | counter |
database.<name>.Initialize.error | Number of database secrets engine initialization operation errors for the named database secrets engine <name> , for example: database.postgresql-prod.Initialize.error | errors | counter |
database.Close | Time taken to close a database secret engine across all database secrets engines | ms | summary |
database.<name>.Close | Time taken to close a database secret engine for the named database secrets engine <name> , for example: database.postgresql-prod.Close | ms | summary |
database.Close.error | Number of database secrets engine close operation errors across all database secrets engines | errors | counter |
database.<name>.Close.error | Number of database secrets engine close operation errors for the named database secrets engine <name> , for example: database.postgresql-prod.Close.error | errors | counter |
database.CreateUser | Time taken to create a user across all database secrets engines | ms | summary |
database.<name>.CreateUser | Time taken to create a user for the named database secrets engine <name> | ms | summary |
database.CreateUser.error | Number of user creation operation errors across all database secrets engines | errors | counter |
database.<name>.CreateUser.error | Number of user creation operation errors for the named database secrets engine <name> , for example: database.postgresql-prod.CreateUser.error | errors | counter |
database.RenewUser | Time taken to renew a user across all database secrets engines | ms | summary |
database.<name>.RenewUser | Time taken to renew a user for the named database secrets engine <name> , for example: database.postgresql-prod.RenewUser | ms | summary |
database.RenewUser.error | Number of user renewal operation errors across all database secrets engines | errors | counter |
database.<name>.RenewUser.error | Number of user renewal operations for the named database secrets engine <name> , for example: database.postgresql-prod.RenewUser.error | errors | counter |
database.RevokeUser | Time taken to revoke a user across all database secrets engines | ms | summary |
database.<name>.RevokeUser | Time taken to revoke a user for the named database secrets engine <name> , for example: database.postgresql-prod.RevokeUser | ms | summary |
database.RevokeUser.error | Number of user revocation operation errors across all database secrets engines | errors | counter |
database.<name>.RevokeUser.error | Number of user revocation operations for the named database secrets engine <name> , for example: database.postgresql-prod.RevokeUser.error | errors | counter |
secrets.pki.tidy.cert_store_current_entry | The index of the current entry in the certificate store being verified by the tidy operation | entry index | gauge |
secrets.pki.tidy.cert_store_deleted_count | Number of entries deleted from the certificate store | entry | counter |
secrets.pki.tidy.cert_store_total_entries | Number of entries in the certificate store to verify during the tidy operation | entry | gauge |
secrets.pki.tidy.duration | Duration of time taken by the PKI tidy operation | ms | summary |
secrets.pki.tidy.failure | Number of times the PKI tidy operation has not completed due to errors | operations | counter |
secrets.pki.tidy.revoked_cert_current_entry | The index of the current revoked certificate entry in the certificate store being verified by the tidy operation | entry index | gauge |
secrets.pki.tidy.revoked_cert_deleted_count | Number of entries deleted from the certificate store for revoked certificates | entry | counter |
secrets.pki.tidy.revoked_cert_total_entries | Number of entries in the certificate store for revoked certificates to verify during the tidy operation | entry | gauge |
secrets.pki.tidy.start_time_epoch | Start time (as seconds since Jan 1 1970) when the PKI tidy operation is active, 0 otherwise | seconds | gauge |
secrets.pki.tidy.success | Number of times the PKI tidy operation has completed succcessfully | operations | counter |
vault.secret.kv.count (cluster, namespace, mount_point) | Number of entries in each key-value secret engine. | paths | gauge |
vault.secret.lease.creation (cluster, namespace, secret_engine, mount_point, creation_ttl) | Counts the number of leases created by secret engines. | leases | counter |
Storage Backend Metrics
These metrics relate to the supported storage backends.
Metric | Description | Unit | Type |
---|---|---|---|
vault.azure.put | Duration of a PUT operation against the Azure storage backend | ms | summary |
vault.azure.get | Duration of a GET operation against the Azure storage backend | ms | summary |
vault.azure.delete | Duration of a DELETE operation against the Azure storage backend | ms | summary |
vault.azure.list | Duration of a LIST operation against the Azure storage backend | ms | summary |
vault.cassandra.put | Duration of a PUT operation against the Cassandra storage backend | ms | summary |
vault.cassandra.get | Duration of a GET operation against the Cassandra storage backend | ms | summary |
vault.cassandra.delete | Duration of a DELETE operation against the Cassandra storage backend | ms | summary |
vault.cassandra.list | Duration of a LIST operation against the Cassandra storage backend | ms | summary |
vault.cockroachdb.put | Duration of a PUT operation against the CockroachDB storage backend | ms | summary |
vault.cockroachdb.get | Duration of a GET operation against the CockroachDB storage backend | ms | summary |
vault.cockroachdb.delete | Duration of a DELETE operation against the CockroachDB storage backend | ms | summary |
vault.cockroachdb.list | Duration of a LIST operation against the CockroachDB storage backend | ms | summary |
vault.consul.put | Duration of a PUT operation against the Consul storage backend | ms | summary |
vault.consul.transaction | Duration of a Txn operation against the Consul storage backend | ms | summary |
vault.consul.get | Duration of a GET operation against the Consul storage backend | ms | summary |
vault.consul.delete | Duration of a DELETE operation against the Consul storage backend | ms | summary |
vault.consul.list | Duration of a LIST operation against the Consul storage backend | ms | summary |
vault.couchdb.put | Duration of a PUT operation against the CouchDB storage backend | ms | summary |
vault.couchdb.get | Duration of a GET operation against the CouchDB storage backend | ms | summary |
vault.couchdb.delete | Duration of a DELETE operation against the CouchDB storage backend | ms | summary |
vault.couchdb.list | Duration of a LIST operation against the CouchDB storage backend | ms | summary |
vault.dynamodb.put | Duration of a PUT operation against the DynamoDB storage backend | ms | summary |
vault.dynamodb.get | Duration of a GET operation against the DynamoDB storage backend | ms | summary |
vault.dynamodb.delete | Duration of a DELETE operation against the DynamoDB storage backend | ms | summary |
vault.dynamodb.list | Duration of a LIST operation against the DynamoDB storage backend | ms | summary |
vault.etcd.put | Duration of a PUT operation against the etcd storage backend | ms | summary |
vault.etcd.get | Duration of a GET operation against the etcd storage backend | ms | summary |
vault.etcd.delete | Duration of a DELETE operation against the etcd storage backend | ms | summary |
vault.etcd.list | Duration of a LIST operation against the etcd storage backend | ms | summary |
vault.gcs.put | Duration of a PUT operation against the Google Cloud Storage storage backend | ms | summary |
vault.gcs.get | Duration of a GET operation against the Google Cloud Storage storage backend | ms | summary |
vault.gcs.delete | Duration of a DELETE operation against the Google Cloud Storage storage backend | ms | summary |
vault.gcs.list | Duration of a LIST operation against the Google Cloud Storage storage backend | ms | summary |
vault.gcs.lock.unlock | Duration of an UNLOCK operation against the Google Cloud Storage storage backend in HA mode | ms | summary |
vault.gcs.lock.lock | Duration of a LOCK operation against the Google Cloud Storage storage backend in HA mode | ms | summary |
vault.gcs.lock.value | Duration of a VALUE operation against the Google Cloud Storage storage backend in HA mode | ms | summary |
vault.mssql.put | Duration of a PUT operation against the MS-SQL storage backend | ms | summary |
vault.mssql.get | Duration of a GET operation against the MS-SQL storage backend | ms | summary |
vault.mssql.delete | Duration of a DELETE operation against the MS-SQL storage backend | ms | summary |
vault.mssql.list | Duration of a LIST operation against the MS-SQL storage backend | ms | summary |
vault.mysql.put | Duration of a PUT operation against the MySQL storage backend | ms | summary |
vault.mysql.get | Duration of a GET operation against the MySQL storage backend | ms | summary |
vault.mysql.delete | Duration of a DELETE operation against the MySQL storage backend | ms | summary |
vault.mysql.list | Duration of a LIST operation against the MySQL storage backend | ms | summary |
vault.postgres.put | Duration of a PUT operation against the PostgreSQL storage backend | ms | summary |
vault.postgres.get | Duration of a GET operation against the PostgreSQL storage backend | ms | summary |
vault.postgres.delete | Duration of a DELETE operation against the PostgreSQL storage backend | ms | summary |
vault.postgres.list | Duration of a LIST operation against the PostgreSQL storage backend | ms | summary |
vault.s3.put | Duration of a PUT operation against the Amazon S3 storage backend | ms | summary |
vault.s3.get | Duration of a GET operation against the Amazon S3 storage backend | ms | summary |
vault.s3.delete | Duration of a DELETE operation against the Amazon S3 storage backend | ms | summary |
vault.s3.list | Duration of a LIST operation against the Amazon S3 storage backend | ms | summary |
vault.spanner.put | Duration of a PUT operation against the Google Cloud Spanner storage backend | ms | summary |
vault.spanner.get | Duration of a GET operation against the Google Cloud Spanner storage backend | ms | summary |
vault.spanner.delete | Duration of a DELETE operation against the Google Cloud Spanner storage backend | ms | summary |
vault.spanner.list | Duration of a LIST operation against the Google Cloud Spanner storage backend | ms | summary |
vault.spanner.lock.unlock | Duration of an UNLOCK operation against the Google Cloud Spanner storage backend in HA mode | ms | summary |
vault.spanner.lock.lock | Duration of a LOCK operation against the Google Cloud Spanner storage backend in HA mode | ms | summary |
vault.spanner.lock.value | Duration of a VALUE operation against the Google Cloud Spanner storage backend in HA mode | ms | summary |
vault.swift.put | Duration of a PUT operation against the Swift storage backend | ms | summary |
vault.swift.get | Duration of a GET operation against the Swift storage backend | ms | summary |
vault.swift.delete | Duration of a DELETE operation against the Swift storage backend | ms | summary |
vault.swift.list | Duration of a LIST operation against the Swift storage backend | ms | summary |
vault.zookeeper.put | Duration of a PUT operation against the ZooKeeper storage backend | ms | summary |
vault.zookeeper.get | Duration of a GET operation against the ZooKeeper storage backend | ms | summary |
vault.zookeeper.delete | Duration of a DELETE operation against the ZooKeeper storage backend | ms | summary |
vault.zookeeper.list | Duration of a LIST operation against the ZooKeeper storage backend | ms | summary |
Integrated Storage (Raft)
These metrics relate to raft based integrated storage.
Metric | Description | Unit | Type |
---|---|---|---|
vault.raft.apply | Number of Raft transactions occurring over the interval, which is a general indicator of the write load on the Raft servers. | raft transactions / interval | counter |
vault.raft.barrier | Number of times the node has started the barrier i.e the number of times it has issued a blocking call, to ensure that the node has all the pending operations that were queued, to be applied to the node's FSM. | blocks / interval | counter |
vault.raft.candidate.electSelf | Time to request for a vote from a peer. | ms | summary |
vault.raft.commitNumLogs | Number of logs processed for application to the FSM in a single batch. | logs | gauge |
vault.raft.commitTime | Time to commit a new entry to the Raft log on the leader. | ms | timer |
vault.raft.compactLogs | Time to trim the logs that are no longer needed. | ms | summary |
vault.raft.delete | Time to delete file from raft's underlying storage. | ms | summary |
vault.raft.delete_prefix | Time to delete files under a prefix from raft's underlying storage. | ms | summary |
vault.raft.fsm.apply | Number of logs committed since the last interval. | commit logs / interval | summary |
vault.raft.fsm.applyBatch | Time to apply batch of logs. | ms | summary |
vault.raft.fsm.applyBatchNum | Number of logs applied in batch. | ms | summary |
vault.raft.fsm.enqueue | Time to enqueue a batch of logs for the FSM to apply. | ms | timer |
vault.raft.fsm.restore | Time taken by the FSM to restore its state from a snapshot. | ms | summary |
vault.raft.fsm.snapshot | Time taken by the FSM to record the current state for the snapshot. | ms | summary |
vault.raft.fsm.store_config | Time to store the configuration. | ms | summary |
vault.raft.get | Time to retrieve file from raft's underlying storage. | ms | summary |
vault.raft.leader.dispatchLog | Time for the leader to write log entries to disk. | ms | timer |
vault.raft.leader.dispatchNumLogs | Number of logs committed to disk in a batch. | logs | gauge |
vault.raft.list | Time to retrieve list of keys from raft's underlying storage. | ms | summary |
vault.raft.peers | Number of peers in the raft cluster configuration. | peers | gauge |
vault.raft.put | Time to persist key in raft's underlying storage. | ms | summary |
vault.raft.replication.appendEntries.log | Number of logs replicated to a node, to bring it up to speed with the leader's logs. | logs appended / interval | counter |
vault.raft.replication.appendEntries.rpc | Time taken by the append entries RFC, to replicate the log entries of a leader node onto its follower node(s). | ms | timer |
vault.raft.replication.heartbeat | Time taken to invoke appendEntries on a peer, so that it doesn’t timeout on a periodic basis. | ms | timer |
vault.raft.replication.installSnapshot | Time taken to process the installSnapshot RPC call. This metric should only be seen on nodes which are currently in the follower state. | ms | timer |
vault.raft.restore | Number of times the restore operation has been performed by the node. Here, restore refers to the action of raft consuming an external snapshot to restore its state. | operation invoked / interval | counter |
vault.raft.restoreUserSnapshot | Time taken by the node to restore the FSM state from a user's snapshot. | ms | timer |
vault.raft.rpc.appendEntries | Time taken to process an append entries RPC call from a node. | ms | timer |
vault.raft.rpc.appendEntries.processLogs | Time taken to process the outstanding log entries of a node. | ms | timer |
vault.raft.rpc.appendEntries.storeLogs | Time taken to add any outstanding logs for a node, since the last appendEntries was invoked. | ms | timer |
vault.raft.rpc.installSnapshot | Time taken to process the installSnapshot RPC call. This metric should only be seen on nodes which are currently in the follower state. | ms | timer |
vault.raft.rpc.processHeartbeat | Time taken to process a heartbeat request. | ms | timer |
vault.raft.rpc.requestVote | Time taken to complete requestVote RPC call. | ms | summary |
vault.raft.snapshot.create | Time taken to initialize the snapshot process. | ms | timer |
vault.raft.snapshot.persist | Time taken to dump the current snapshot taken by the node to the disk. | ms | timer |
vault.raft.snapshot.takeSnapshot | Total time involved in taking the current snapshot (creating one and persisting it) by the node. | ms | timer |
vault.raft.state.follower | Number of times node has entered the follower mode. This happens when a new node joins the cluster or after the end of a leader election. | follower state entered / interval | counter |
vault.raft.transition.heartbeat_timeout | Number of times node has transitioned to the Candidate state, after receive no heartbeat messages from the last known leader. | timeouts / interval | counter |
vault.raft.transition.leader_lease_timeout | Number of times quorum of nodes were not able to be contacted. | contact failures | counter |
vault.raft.verify_leader | Number of times node checks whether it is still the leader or not. | checks / interval | counter |
vault.raft-storage.delete | Time to insert log entry to delete path. | ms | timer |
vault.raft-storage.get | Time to retrieve value for path from FSM. | ms | timer |
vault.raft-storage.put | Time to insert log entry to persist path. | ms | timer |
vault.raft-storage.list | Time to list all entries under the prefix from the FSM. | ms | timer |
vault.raft-storage.transaction | Time to insert operations into a single log. | ms | timer |
vault.raft-storage.entry_size | The total size of a Raft entry during log application in bytes. | bytes | summary |
vault.raft_storage.bolt.freelist. free_pages | Number of free pages in the freelist. | pages | gauge |
vault.raft_storage.bolt.freelist. pending_pages | Number of pending pages in the freelist. | pages | gauge |
vault.raft_storage.bolt.freelist. allocated_bytes | Total bytes allocated in free pages. | bytes | gauge |
vault.raft_storage.bolt.freelist. used_bytes | Total bytes used by the freelist. | bytes | gauge |
vault.raft_storage.bolt.transaction. started_read_transactions | Number of started read transactions. | transactions | gauge |
vault.raft_storage.bolt.transaction. currently_open_read_transactions | Number of currently open read transactions. | transactions | gauge |
vault.raft_storage.bolt.page.count | Number of page allocations. | allocations | gauge |
vault.raft_storage.bolt.page. bytes_allocated | Total bytes allocated. | bytes | gauge |
vault.raft_storage.bolt.cursor.count | Number of cursors created. | cursors | gauge |
vault.raft_storage.bolt.node.count | Number of node allocations. | nodes | gauge |
vault.raft_storage.bolt.node.dereferences | Number of node dereferences. | dereferences | gauge |
vault.raft_storage.bolt.rebalance.count | Number of node rebalances. | rebalances | gauge |
vault.raft_storage.bolt.rebalance.time | Time taken rebalancing. | ms | summary |
vault.raft_storage.bolt.split.count | Number of nodes split. | nodes | gauge |
vault.raft_storage.bolt.spill.count | Number of nodes spilled. | nodes | gauge |
vault.raft_storage.bolt.spill.time | Time taken spilling. | ms | summary |
vault.raft_storage.bolt.write.count | Number of writes performed. | writes | gauge |
vault.raft_storage.bolt.write.time | Time taken writing to disk. | ms | summary |
Integrated Storage (Raft) Autopilot
Metric | Description | Unit | Type |
---|---|---|---|
vault.autopilot.node.healthy | Set to 1 if the node_id is deemed healthy by Autopilot, 0 if not | bool | gauge |
vault.autopilot.healthy | Set to 1 if Autopilot considers all nodes healthy | bool | gauge |
vault.autopilot.failure_tolerance | How many nodes can be lost while maintaining quorum, i.e. number of healthy nodes in excess of quorum | nodes | gauge |
Since Autopilot runs only the on the active node, these metrics are only emitted by the active node.
Integrated Storage (Raft) Leadership Changes
Metric | Description | Unit | Type |
---|---|---|---|
vault.raft.leader.lastContact | Measures the time since the leader was last able to contact the follower nodes when checking its leader lease | ms | summary |
vault.raft.state.candidate | Increments whenever raft server starts an election | Elections | counter |
vault.raft.state.leader | Increments whenever raft server becomes a leader | Leaders | counter |
Why they're important: Normally, your raft cluster should have a stable leader. If there are frequent elections or leadership changes, it would likely indicate network issues between the raft nodes, or that the raft servers themselves are unable to keep up with the load.
What to look for: For a healthy cluster, you're looking for a lastContact lower than 200ms, leader > 0 and candidate == 0. Deviations from this might indicate flapping leadership.
Integrated Storage (Raft) Automated Snapshots
These metrics related to the Enterprise feature Raft Automated Snapshots.
Metric | Description | Unit | Type |
---|---|---|---|
vault.autosnapshots.total.snapshot.size | For storage_type=local, space on disk used by saved snapshots | bytes | gauge |
vault.autosnapshots.percent.maxspace.used | For storage_type=local, percent used of maximum allocated space | percentage | gauge |
vault.autosnapshots.save.errors | Increments whenever an error occurs trying to save a snapshot | n/a | counter |
vault.autosnapshots.save.duration | Measures the time taken saving a snapshot | ms | summary |
vault.autosnapshots.last.success.time | Epoch time (seconds since 1970/01/01) of last successful snapshot save | n/a | gauge |
vault.autosnapshots.snapshot.size | Measures the size in bytes of snapshots | bytes | summary |
vault.autosnapshots.rotate.duration | Measures the time taken to rotate (i.e. delete) old snapshots to satisfy configured retention | ms | summary |
vault.autosnapshots.snapshots.in.storage | Number of snapshots in storage | n/a | gauge |
Metric Labels
Metric | Description | Example |
---|---|---|
auth_method | Authorization engine type . | userpass |
cluster | The cluster name from which the metric originated; set in the configuration file, or automatically generated when a cluster is create | vault-cluster-d54ad07 |
creation_ttl | Time-to-live value assigned to a token or lease at creation. This value is rounded up to the next-highest bucket; the available buckets are 1m , 10m , 20m , 1h , 2h , 1d , 2d , 7d , and 30d . Any longer TTL is assigned the value +Inf . | 7d |
mount_point | Path at which an auth method or secret engine is mounted. | auth/userpass/ |
namespace | A namespace path, or root for the root namespace | ns1 |
policy | A single named policy | default |
secret_engine | The [secret engine][secrets-engine] type. | aws |
token_type | Identifies whether the token is a batch token or a service token. | service |
peer_id | Unique identifier of a raft peer. | node-1 |
node_id | Unique identifier of a raft peer, same as peer_id. | node-1 |
snapshot_config_name | For automated snapshots, the name of the configuration | config1 |