Health Checks
Humio exposes information about its own health. This information is available through the Health Check API. This feature is in development and as such will continue to change. Please check the documentation and release notes for updates.
The overall health of a Humio node can have 3 states:
OK
,
WARN
, and
DOWN
.
OK
— All health checks are within normal operational parameters.WARN
— At least one health check is in a WARN state and an operator should look into why that is.DOWN
— At least one health check is in a DOWN state. The node is down and not working. With dynamic load balancers, nodes that are down should be removed from the active set of nodes.
Part of the overall health state is the uptime value. When a node has just
started, there is a grace period where the overall state will always be
WARN
even if some checks are
DOWN
. This is to give the system
time to become stable, but also to indicate that things might fluctuate a
bit. The grace period can be set with the configuration parameter
HEALTH_CHECK_GRACE_PERIOD_SEC
. The default value is
30
.
Determining if a situation is affecting the health of a node can be specific to each installation. Therefore, most checks will have configuration parameters for controlling thresholds. See the reference below.
Health Check Reference
Health Check Name | Description |
---|---|
backup-disk-usage
|
Percent used of the backup disk (only present when backup disk
is used). Will be WARN if
used % >=
HEALTH_CHECK__BACKUP_DISK_USAGE__WARN_THRESHOLD_SEC
seconds (defaults to 90) Humio (and Kafka) will crash if data
directories run out of free disk space and doing so will demand
manual intervention and quite possible loose data in transit.
|
event-latency-p99
|
Ingest latency (99th percentile). This latency is measured
from the time an event in received by Humio and until the
digest phase is done processing that event (running live
searches and persisting to disk). This check will return a
If you experience overload situations we will be very interested in working with you to fix the situation and improve the resiliency of Humio going forward.)
The Query Monitor in the administrative page can be used to find and disable problematic queries if this happens.
The following is some of the corrective measures Humio will take. Humio will strive towards having low ingest latency, at the expense of searches if necessary.
|
failing-http-status-checks
|
Number of failed http status checks within the last 60 seconds.
This check will return a
WARN status if the count
is 1 or more. Note, this check might count the same host
multiple times within the same time interval. Nodes in Humio use
HTTPS when running searches and can only work if all (digest and
storage) nodes can reach each other. Note that when a node dies,
both this health check and
missing-nodes-via-kafka
will eventually fails, but this might not happen at the same
time as the checks run asynchronously.
|
missing-nodes-via-kafka
|
Number of nodes not heard from via kafka within the last 90
seconds. This check will return a
WARN status if the node
count is 1 or more. Kafka is used to ship ingest data around the
cluster and for sharing global meta data. A node not present in
Kafka means that it is unavailable. As long as the number of
nodes missing is fewer that the replication factor this should
not effect Humio, but an operator should immediately take
corrective measures to get all nodes back up and running.
|
primary-disk-usage
|
Percent used of the primary disk. Will be WARN if used
% >=
HEALTH_CHECK__PRIMARY_DISK_USAGE__WARN_THRESHOLD_SEC
seconds (defaults to 90) Humio (and Kafka) will crash if data
directories run out of free disk space and doing so will demand
manual intervention and quite possible loose data in transit.
|
secondary-disk-usage
|
Percent used of the secondary disk (only present when secondary
disk is used). Will be WARN if used
% >=
HEALTH_CHECK__SECONDARY_DISK_USAGE__WARN_THRESHOLD_SEC
seconds (defaults to 90). Humio (and Kafka) will crash if data
directories run out of free disk space and doing so will demand
manual intervention and quite possible loose data in transit.
|